

词条 现代信息检索(英文第2版)


原书名: Modern Information Retrieval: The Concepts and Technology behind Search (2nd Edition)

原出版社: Addison-Wesley Professional

作者: (西班牙)Ricardo Baeza-Yates (巴西)Berthier Ribeiro-Neto

丛书名: 经典原版书库




出版日期:2011 年3月










1 introduction 1

1.1 information retrieval 1

1.1.1 early developments 1

1.1.2 information retrieval in libraries and digital libraries 3

1.1.3 ir at the center of the stage 3

1.2 the ir problem 3

1.2.1 the user’s task 4

1.2.2 information versus data retrieval 5

1.3 the ir system 5

1.3.1 software architecture of the ir system 5

1.3.2 the retrieval and ranking processes 7

1.4 theweb 8

1.4.1 a brief history 8

1.4.2 the e-publishing era 9

1.4.3 how the web changed search 10

1.4.4 practical issues on the web 12

1.5 organization of the book 12

1.5.1 focus of the book 12

1.5.2 book contents 13

1.6 the book web site: a teaching resource 16

.1.7 bibliographic discussion 17

2 user interfaces for search 21

by marti hearst

2.1 introduction 21

2.2 how people search 21

preface to the second edition v

preface to the first edition vii

authors’ acknowledgements to the second edition viii

authors’ acknowledgements to the first edition x

publishers’ acknowledgements xii

contents xvii

2.2.1 information lookup versus exploratory search 22

2.2.2 classic versus dynamic model of information seeking 23

2.2.3 navigation versus search 24

2.2.4 observations of the search process 24

2.3 search interfaces today 25

2.3.1 getting started 25

2.3.2 query specification 26

2.3.3 query specification interfaces 27

2.3.4 retrieval results display 29

2.3.5 query reformulation 32

2.3.6 organizing search results 35

2.4 visualization in search interfaces 40

2.4.1 visualizing boolean syntax 42

2.4.2 visualizing query terms within retrieval results 43

2.4.3 visualizing relationships among words and documents 47

2.4.4 visualization for text mining 49

2.5 design and evaluation of search interfaces 50

2.6 trends and research issues 54

2.7 bibliographic discussion 54

3 modeling 57

3.1 ir models 57

3.1.1 modeling and ranking 57

3.1.2 characterization of an ir model 58

3.1.3 a taxonomy of ir models 59

3.2 classic information retrieval 61

3.2.1 basic concepts 61

3.2.2 the boolean model 64

3.2.3 term weighting 66

3.2.4 tf-idf weights 68

3.2.5 document length normalization 75

3.2.6 the vector model 77

3.2.7 the probabilistic model 79

3.2.8 brief comparison of classic models 86

3.3 alternative set theoretic models 87

3.3.1 set-based model 87

3.3.2 extended boolean model 92

3.3.3 fuzzy set model 95

3.4 alternative algebraic models 98

3.4.1 generalized vector space model 98

3.4.2 latent semantic indexing model 101

3.4.3 neural network model 102

3.5 alternative probabilistic models 104

3.5.1 bm25 104

3.5.2 language models 107

3.5.3 divergence from randomness 113

3.5.4 bayesian network models 116

3.6 other models 124

3.6.1 the hypertext model 124

3.6.2 web based models 125

3.6.3 structured text retrieval 126

3.6.4 multimedia retrieval 126

3.6.5 enterprise and vertical search 126

3.7 trends and research issues 127

3.8 bibliographic discussion 128

4 retrieval evaluation 131

4.1 introduction 131

4.2 the cranfield paradigm 132

4.2.1 a brief history 132

4.2.2 reference collections 134

4.3 retrieval metrics 134

4.3.1 precision and recall 135

4.3.2 single value summaries: p@n, map, mrr, f 139

4.3.3 user-oriented measures 144

4.3.4 dcg: discounted cumulated gain 145

4.3.5 bpref: binary preferences 150

4.3.6 rank correlation metrics 153

4.4 reference collections 158

4.4.1 the trec collections 159

4.4.2 other reference collections 166

4.4.3 other small test collections 167

4.5 user-based evaluation 168

4.5.1 human experimentation in the lab 168

4.5.2 side-by-side panels 168

4.5.3 a/b testing 169

4.5.4 crowdsourcing 170

4.5.5 evaluation using clickthrough data 171

4.6 practical caveats 173

4.7 trends and research issues 174

4.8 bibliographic discussion 174

5 relevance feedback and query expansion 177

5.1 introduction 177

5.2 a framework for feedback methods 178

5.3 explicit relevance feedback 180

5.3.1 relevance feedback for the vector model: rocchio method 181

5.3.2 relevance feedback for the probabilistic model 183

5.3.3 evaluation of relevance feedback 184

5.4 explicit feedback through clicks 185

5.4.1 eye tracking and relevance judgements 185

5.4.2 user behavior 186

5.4.3 clicks as a metric of user preferences 187

5.5 implicit feedback through local analysis 190

5.5.1 implicit feedback through local clustering 190

5.5.2 implicit feedback through local context analysis 193

5.6 implicit feedback through global analysis 195

5.6.1 query expansion based on a similarity thesaurus 195

5.6.2 query expansion based on a statistical thesaurus 198

5.7 trends and research issues 200

5.8 bibliographic discussion 200

6 documents: languages & properties 203

with gonzalo navarro and nivio ziviani

6.1 introduction 203

6.2 metadata 205

6.3 document formats 206

6.3.1 text 206

6.3.2 multimedia 207

6.3.3 graphics and virtual reality 208

6.4 markup languages 208

6.4.1 sgml 209

6.4.2 html 211

6.4.3 xml 214

6.4.4 rdf: resource description framework 216

6.4.5 hytime 217

6.5 text properties 218

6.5.1 information theory 218

6.5.2 modeling natural language 219

6.5.3 text similarity 222

6.6 document preprocessing 223

6.6.1 lexical analysis of the text 224

6.6.2 elimination of stopwords 226

6.6.3 stemming 226

6.6.4 keyword selection 227

6.6.5 thesauri 228

6.7 organizing documents 231

6.7.1 taxonomies 231

6.7.2 folksonomies 232

6.8 text compression 233

6.8.1 basic concepts 234

6.8.2 statistical methods 234

6.8.3 statistical methods: modeling 235

6.8.4 statistical methods: coding 238

6.8.5 dictionary methods 245

6.8.6 preprocessing for compression 246

6.8.7 comparing text compression techniques 248

6.8.8 structured text compression 249

6.9 trends and research issues 250

6.10 bibliographical discussion 253

7 queries: languages & properties 255

with gonzalo navarro

7.1 query languages 255

7.1.1 keyword-based querying 256

7.1.2 beyond keywords 259

7.1.3 structural queries 262

7.1.4 query protocols 265

7.2 query properties 267

7.2.1 characterizing web queries 267

7.2.2 user search behavior 269

7.2.3 query intent 270

7.2.4 query topic 272

7.2.5 query sessions and missions 273

7.2.6 query difficulty 274

7.3 trends and research issues 278

7.4 bibliographical discussion 279

8 text classification 281

with marcos gon?calves

8.1 introduction 281

8.2 a characterization of text classification 282

8.2.1 machine learning 282

8.2.2 the text classification problem 283

8.2.3 text classification algorithms 284

8.3 unsupervised algorithms 286

8.3.1 clustering 286

8.3.2 naive text classification 290

8.4 supervised algorithms 291

8.4.1 decision trees 294

8.4.2 the k-nn classifier 299

8.4.3 the rocchio classifier 300

8.4.4 probabilistic naive bayes document classification 303

8.4.5 the svm classifier 306

8.4.6 ensemble classifiers 316

8.4.7 final remarks on supervised algorithms 319

8.5 feature selection or dimensionality reduction 320

8.5.1 term–class incidence table 321

8.5.2 term document frequency 322

8.5.3 tf-idf weights 322

8.5.4 mutual information 323

8.5.5 information gain 323

8.5.6 chi square 324

8.5.7 impact of feature selection 325

8.6 evaluation metrics 325

8.6.1 contingency table 325

8.6.2 accuracy and error 326

8.6.3 precision and recall 327

8.6.4 f-measure and f1 327

8.6.5 cross-validation 329

8.6.6 standard collections 329

8.7 organizing the classes – building taxonomies 330

8.8 trends and research issues 333

8.9 bibliographic discussion 334

9 indexing and searching 337

with gonzalo navarro

9.1 introduction 337

9.2 inverted indexes 340

9.2.1 basic concepts 340

9.2.2 full inverted indexes 341

9.2.3 searching 345

9.2.4 ranking 348

9.2.5 construction 351

9.2.6 compressed inverted indexes 354

9.2.7 structural queries 357

9.3 signature files 357

9.4 suffix trees and suffix arrays 360

9.4.1 structure: tries and suffix trees 361

9.4.2 searching for simple strings 362

9.4.3 searching for complex patterns 363

9.4.4 construction 365

9.4.5 compressed suffix arrays 367

9.5 sequential searching 372

9.5.1 simple strings: horspool 373

9.5.2 complex patterns: automata and bit-parallelism 375

9.5.3 faster bit-parallel algorithms 379

9.5.4 regular expressions 382

9.5.5 multiple patterns 384

9.5.6 approximate searching 385

9.5.7 searching compressed text 389

9.6 multi-dimensional indexing 391

9.7 trends and research issues 393

9.8 bibliographic discussion 394

10 parallel and distributed ir 399

with eric brown

10.1 introduction 399

10.2 a taxonomy of distributed ir systems 402

10.3 data partitioning 404

10.3.1 collection partitioning 405

10.3.2 collection selection 407

10.3.3 inverted index partitioning 409

10.3.4 partitioning other indexes 413

10.4 parallel ir 414

10.4.1 introduction 414

10.4.2 parallel ir on mimd architectures 416

10.4.3 parallel ir on simd architectures 418

10.5 cluster-based ir 423

10.6 distributed ir 424

10.6.1 introduction 424

10.6.2 indexing 428

10.6.3 query processing 431

10.6.4 web issues 437

10.7 federated search 438

10.8 retrieval in peer-to-peer networks 440

10.9 trends and research issues 444

10.10bibliographic discussion 445

11 web retrieval 447

with yoelle maarek

11.1 introduction 447

11.2 a challenging problem 449

11.3 the web 451

11.3.1 characteristics 451

11.3.2 structure of the web graph 452

11.3.3 modeling the web 454

11.3.4 link analysis 456

11.4 search engine architectures 458

11.4.1 basic architecture 458

11.4.2 cluster-based architecture 459

11.4.3 caching 462

11.4.4 multiple indexes 464

11.4.5 distributed architectures 466

11.5 search engine ranking 468

11.5.1 ranking signals 469

11.5.2 link-based ranking 470

11.5.3 simple ranking functions 473

11.5.4 learning to rank 473

11.5.5 learning the ranking function 474

11.5.6 quality evaluation 475

11.5.7 web spam 476

11.6 managing web data 477

11.6.1 assigning identifiers to documents 477

11.6.2 metadata 478

11.6.3 compressing the web graph 478

11.6.4 handling duplicated data 479

11.7 search engine user interaction 480

11.7.1 the search rectangle paradigm 481

11.7.2 the search engine result page 488

11.7.3 educating the user 497

11.8 browsing 498

11.8.1 flat browsing 499

11.8.2 structure guided browsing and web directories 499

11.9 beyond browsing 501

11.9.1 hypertext and the web 501

11.9.2 combining searching with browsing 501

11.9.3 web query languages 503

11.9.4 dynamic search 503

11.10related problems 504

11.10.1 computational advertising 504

11.10.2web mining 506

11.10.3 metasearch 508

11.11trends and research issues 509

11.11.1 beyond static text data 509

11.11.2 current challenges 511

11.12bibliographical discussion 513

12 web crawling 515

with carlos castillo

12.1 introduction 515

12.2 applications of a web crawler 517

12.2.1 general web search 517

12.2.2 topical crawling 518

12.2.3 web characterization 518

12.2.4 mirroring 518

12.2.5 web site analysis 519

12.3 a taxonomy of crawlers 519

12.3.1 types of web pages 520

12.4 architecture and implementation 521

12.4.1 crawler architecture 521

12.4.2 practical issues 523

12.4.3 parallel crawling 526

12.5 scheduling algorithms 527

12.5.1 selection policy 528

12.5.2 revisit policy 530

12.5.3 politeness policy 535

12.5.4 combining policies 538

12.6 evaluation 539

12.6.1 evaluating network usage 539

12.6.2 evaluating long-term scheduling 540

12.7 trends and research issues 541

12.7.1 crawling the “hidden” web 541

12.7.2 crawling with the help of web sites 542

12.7.3 distributed crawling 543

12.8 bibliographic discussion 543

13 structured text retrieval 545

with mounia lalmas

13.1 introduction 545

13.2 structuring power 546

13.2.1 explicit vs. implicit structure 546

13.2.2 static vs. dynamic structure 547

13.2.3 single hierarchy vs. multiple hierarchies 548

13.3 early text retrieval models 549

13.3.1 model based on non-overlapping lists 549

13.3.2 model based on proximal nodes 550

13.3.3 ranking structured text results 551

13.4 xml retrieval 551

13.4.1 challenges in xml retrieval 551

13.4.2 indexing strategies 553

13.4.3 ranking strategies 554

13.4.4 removing overlaps 565

13.5 xml retrieval evaluation 566

13.5.1 document collections 566

13.5.2 topics 567

13.5.3 retrieval tasks 568

13.5.4 relevance 569

13.5.5 measures 571

13.6 query languages 573

13.6.1 characteristics 574

13.6.2 classification of xml query languages 575

13.6.3 examples of xml query languages 577

13.7 trends and research issues 582

13.8 bibliographic discussion 585

14 multimedia information retrieval 587

by dulce poncele′on and malcolm slaney

14.1 introduction 587

14.1.1 what is multimedia? 587

14.1.2 multimedia ir 588

14.1.3 text ir versus multimedia ir 589

14.2 the challenges 589

14.2.1 the semantic gap 589

14.2.2 feature ambiguity 591

14.2.3 machine-generated data 591

14.3 content-based image retrieval 592

14.3.1 color-based retrieval 593

14.3.2 texture 593

14.3.3 salient points 596

14.4 audio and music retrieval 597

14.4.1 fingerprinting 598

14.4.2 speech recognition 599

14.4.3 speaker identification 601

14.4.4 spoken document retrieval 602

14.4.5 audio basics 602

14.5 retrieving and browsing video 606

14.5.1 video abstracts 606

14.5.2 static summaries 607

14.5.3 mosaics and salient stills 608

14.5.4 dynamic summaries 609

14.5.5 interactive summaries 611

14.5.6 visual vs. audio browsing 612

14.5.7 evaluating summaries 613

14.6 fusion models: combining it all 614

14.6.1 naming faces 614

14.6.2 naming images 615

14.6.3 naming audio 616

14.6.4 combining audio and video for avsr 617

14.6.5 combining audio and video for multimedia 620

14.7 segmentation 620

14.7.1 a video segmentation example 620

14.7.2 segmentation schemes for video 622

14.7.3 video segmentation with edges 623

14.7.4 speech segmentation 624

14.7.5 segmentation evaluation 625

14.8 compression and mpeg standards 625

14.8.1 intensity and sampling 626

14.8.2 color 626

14.8.3 lossy compression 628

14.8.4 lossless compression 628

14.8.5 temporal redundancy 630

14.8.6 motion prediction 631

14.8.7 mpeg standards 633

14.9 trends and research issues 636

14.10bibliographic discussion 637

15 enterprise search 641

by david hawking

15.1 introduction 641

15.1.1 characteristics and applications of enterprise search 642

15.1.2 enterprise search software 643

15.1.3 workplace search 644

15.2 enterprise search tasks 644

15.2.1 examples of search-supported tasks 644

15.2.2 search types 647

15.2.3 studying enterprise search 647

15.3 architecture of enterprise search systems 648

15.3.1 gathering 648

15.3.2 extracting 651

15.3.3 indexing 652

15.3.4 indexing textual annotations 653

15.3.5 query processing 654

15.3.6 presentation of search results 655

15.3.7 security models 657

15.3.8 federation/metasearch 659

15.4 enterprise search evaluation 662

15.4.1 published test collections for enterprise search 662

15.4.2 internal enterprise search evaluations 663

15.4.3 enterprise search tuning 665

15.4.4 what is it reasonable to expect? 666

15.5 potential reasons for dissatisfaction 667

15.6 context and personalization 668

15.6.1 controls and levers for contextualization 671

15.6.2 contextualization: local, enterprise or global? 675

15.6.3 privacy of profiles 676

15.6.4 defining, creating and maintaining a profile 677

15.6.5 user modeling 677

15.6.6 implicit measures 679

15.6.7 information filtering 679

15.6.8 social recommender systems 680

15.7 trends and research issues 681

15.8 bibliographic discussion 681

16 library systems 685

by edie rasmussen

16.1 the information environment in the library 685

16.2 online public access catalogues 687

16.2.1 opacs and bibliographic records 689

16.2.2 information retrieval from the ils 691

16.2.3 integrating the hybrid library 693

16.2.4 opacs and end users 694

16.2.5 ils: vendors and products 695

16.3 ir systems and document databases 697

16.3.1 bibliographic and full-text databases 698

16.3.2 content of database records 698

16.3.3 the online industry: database vendors 701

16.3.4 information retrieval from document databases 702

16.4 information retrieval in organizations 706

16.5 trends and research issues 708

16.6 bibliographic discussion 709

17 digital libraries 711

by marcos gon?calves

17.1 introduction 711

17.2 defining digital libraries 712

17.3 a general architecture 713

17.4 fundamentals 714

17.4.1 digital objects and collections 714

17.4.2 metadata and catalogs 716

17.4.3 repositories/archives 719

17.4.4 services 723

17.5 social-economical issues 725

17.5.1 social issues 725

17.5.2 economical issues 726

17.6 software systems 727

17.6.1 greenstone 728

17.6.2 eprints 728

17.6.3 dspace 728

17.6.4 fedora 729

17.6.5 open digital libraries 729

17.6.6 the 5s suite 730

17.7 dl case studies 731

17.7.1 the networked dl of theses and dissertations 731

17.7.2 the national science digital library 732

17.7.3 the etana-dl archaeological digital library 732

17.8 trends and research issues 733

17.8.1 evaluation 733

17.8.2 integration 733

17.8.3 other research challenges 734

17.9 bibliographic discussion 735

a open source search engines 737

with christian middleton

a.1 introduction 737

a.2 search engines 738

a.2.1 preliminary selection of search engines 738

a.2.2 features 741

a.2.3 evaluation 742

a.3 methodology 743

a.3.1 document collections 743

a.3.2 evaluation tests 744

a.3.3 experimental setup 744

a.4 experimental results 745

a.4.1 test a – indexing 745

a.4.2 test b – incremental indexing 749

a.4.3 test c – search performance 749

a.4.4 global evaluation 752

a.5 conclusions 753

b biographies 755

references 761

index 893

