introduction to Information Retrieval 阅读笔记之第一章

2024-03-04 17:33:47

引言

在联系了保研导师后，导师决定这个暑假让我开始学习信息检索技术，并直接给我发了一本英文版IR大作——《Introduction to information retrieval》，并让我每看完一章写一个英文报告。据说此书是IR入门宝书，所以第一眼看到是~~头皮发麻的~~ (心情激动的)。那么从今天开始我也会在博客上同步更新我的阅读总结//啃书史，希望能够一直坚持！

想要原书pdf版本的去网盘下吧~
链接：https://pan.baidu.com/s/1PkJ-I-HNfyNHgyb5aZGNaA
提取码：w8h8

Chapter 1: Boolean retrieval

In the first chapter, the author use a example about information intrieval to introduce Booleanretrieval. Then the definitions of binary term-document incidence matrix and inverted index and their applications in Boolean retrieval are introduced respectively. In addition, the author also introduce other main methods and some difficulties of information retrieval.

definitions

Here are definitions of terms in this chapter:

Information retrieval：Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfifies an information need from within large collections (usually stored on computers).
Term-document incidence matrix: A matrix to show whether a term is contained in a document(defined by me). Matrix element (t, d) is 1 if the play in column d contains the word in row t, and is 0 otherwise. Here is an example:
Inverted index: It is composed of two parts: dictionary in memory and Postings in the disk. Dictionary contains all the terms shown in the dicuments. And for each term, it has a pointer to its posting in the disk which records the id of documents that a term appeared. For example, in the below index, wocan see the term “Calpurnia” is contained in document 2,31,54,101.

detailed summary

Here are my summary and thinking:
In general, the first chapter is how to use Boolean Retrieval to search for documents with specific word combination conditions in a large number of documents. By definition, information retrieval works on unstructured data, not structured data. ”structured data” refers to data that does not have clear, semantically overt, easy-for-a-computer structure. For example, the data stored in a relational database is structured data. So we can use SQL statements directly to look up structured data. For unstructured data, we usually think that it is not completely unstructured, and we can find hidden language structure information in it to try to build the structure.
For a large number of web documents, we must first index all the documents. When users use specific keywords to retrieval documents, we only need to extract the index to carry out certain operations to get the retrieval results. The establishment of index in advance can greatly speed up the retrieval time and reduce the repetitive workload. The index methods introduced in the first chapter respectively use term-document incidence matrix and Inverted index.
For Term-document incidence matrix, we get the results vector by extracing vectors of the terms given by user and then carrying out “and or not” logic operations. In the result vecto, “1” represents that a document is matching. Taking the picture in the book for example, if query is “Brutus AND Caesar AND NOT Calpurnia”, then we just have to use the three vectors to do some logical operations.

As for Inverted index, it is taken out by the author because there are many “0” in the term-document incidence matrix, that is to say, it is a sparse matrix. In this case, linked list is often used to replace the matrix in the data structure. In Inverted index, there are still terms, and each term no longer has a vector of equal length. Instead, a linked list or variable length array of document sequence numbers is used to record the documents in which the words appear. Thus, the logical operation between bit vectors becomes the logical operation between sets, but the essence remains the same. For how to quickly carry out logical operation on the collection represented by the linked list, efficient algorithm is introduced in the book, making time complexity o(x+y), where x and y respectively represent the length of two sets. The algorithm idea is as follows:
AND operation: Maintains a pointer to the current comparison element in two linked lists. If the elements being compared are value-equal, the element will be add into the result set with both pointers moving backwards. If not identical, the pointer to the element with the lower value moves backwards. The algorithm ends when either list has been traversed.
OR operation: Maintains a pointer to the current comparison element in two linked lists. When the elements being compared are the same, the element will be add into the result set, and both pointers move back. If not identical, the pointer to the element with a lower value is moved back, and the element is add into the result set. When both lists are traversed, the algorithm ends.
AND NOT: take “A AND NOT B” as an example. Maintains a pointer to the current comparison element in two linked lists. When the elements being compared are the value-equal, both pointers move backwards. If the values are different and the current comparison element of B is smaller, move the pointer to B backward. If the values are different and the current comparison element of A is smaller, add the current comparison element of A to the result set and move the pointer of A backward. When both lists are traversed, the algorithm ends.

optimization

Some algorithm optimization for Boolean retrieval:

Place the shortest Posting term at the front of logical operation list, which can reduce the number of operations.
For two iterms whose length of Posting has large differece, we can use some strategies to speed up. Such as searching the long posting list using binary search method to find each item in the short posting list. If the postings are ordered, this can greatly speed up the search rate.

Limitations

Limitations of the retrieval methods in Chapter 1:

Unable to recognize the spelling mistakes of words, and requires high accuracy of words.
The neglect of synonyms. Some retrievaling word may have sunonyms and researcher also wants the system to be able to return documents with synonyms.
Unable to sort the returned documents, because Boolean sort can only determine whether a document should be retrieved, but there is no way to calculate its matching degree, so the documents cannot be sorted.
It is useless for queries with specific requirements. Let’s say I want to find A and B, while AB is in the same sentence. Because I do not record the word position in posting, I cannot query.

My question:

It is said in the book that there is pointer mapping from memory to hard disk. How is this realized? There are pointers in C++, but that’s
limited to memory. How can I build pointers across memory and hard
disk?

Since the document index is pre-established, for the ever-changing collection of documents on the Internet, are the new (updated)
documents updating index all the time?

码农公寓

introduction to Information Retrieval 阅读笔记之第一章

目录

引言

Chapter 1: Boolean retrieval

definitions

detailed summary

optimization

Limitations

码农公寓

目录

引言

Chapter 1: Boolean retrieval

definitions

detailed summary

optimization

Limitations

相关文章