[MUSIC] This lecture is about the implementation of text retrieval systems. In this lecture we will discuss how we can implement a text retrieval method to build a search engine. The main challenge is to manage a lot of text data and to enable a query to be answered very quickly and to respond to many queries. This is a typical text retrieval system architecture. We can see the documents are first processed by a tokenizer to get tokenized units, for example, words. And then, these words, or tokens, will be processed by a indexer that will create a index, which is a data structure for the search engine to use to quickly answer a query. And the query would be going through a similar processing step. So the Tokenizer would be apprised of the query as well, so that the text can be processed in the same way. The same units would be matched with each other. The query's representation would then be given to the Scorer, which would use the index to quickly answer user's query by scoring the documents and then ranking them. The results will be given to the user. And then the user can look at the results and provided us some feedback that can be explicit judgements of both which documents are good, which documents are bad. Or implicit feedback such as so that user didn't have to do anything extra. End user will just look at the results, and skip some, and click on some result to view. So these interacting signals can be used by the system to improve the ranking accuracy by assuming that viewed documents are better than the skipped ones. So a search engine system then can be divided into three parts. The first part is the indexer, and the second part is a Scorer that responds to the users query, and the third part is a Feedback mechanism. Now typically, the Indexer is done in the offline manner, so you can pre-process the correct data and to build the inventory index, which we will introduce in moment. And this data structure can then be used by the online module which is a scorer to process a user's query dynamically and quickly generate search results. The feedback mechanism can be done online or offline, depending on the method. The implementation of the indexer and the scorer is very standard, and this is the main topic of this lecture and the next few lectures. The feedback mechanism, on the other hand, has variations, it depends on which method is used. So that is usually done in algorithms specific way. Let's first talk about the tokenizer. Tokernization is a normalized lexical units in through the same form, so that semantically similar words can be matched with each other. Now, in the language like English, stemming is often used and this will map all the inflectional forms of words into the same root form. So for example, computer, computation, and computing can all be matched to the root form compute. This way all these different forms of computing can be matched with each other. Now normally, this is a good idea, to increase the coverage of documents that are matched up with this query. But it's also not always beneficial, because sometimes the subtlest difference between computer and computation might still suggest the difference in the coverage of the content. But in most cases, stemming seems to be beneficial. When we tokenize the text in some other languages, for example Chinese, we might face some special challenges in segmenting the text to find the word boundaries. Because it's not obvious where the boundary is as there's no space to separate them. So here of course, we have to use some language specific processing techniques. Once we do tokenization, then we would index the text documents and than it'll convert the documents and do some data structure that can enable faster search. The basic idea is to precompute as much as we can basically. So the most commonly used index is call an Inverted index. And this has been used in many search engines to support basic search algorithms. Sometimes the other indices, for example, document index might be needed in order to support feedback, like I said. And these kind of techniques are not really standard in that they vary a lot according to the feedback methods. To understand why we want to use inverted index it will be useful for you to think about how you would respond to a single term query quickly. So if you want to use more time to think about that, pause the video. So think about how you can pre process the text data so that you can quickly respond to a query with just one word. Where if you have thought about that question, you might realize that where the best is to simply create the list of documents that match every term in the vocabulary. In this way, you can basically pre-construct the answers. So when you see a term you can simply just to fetch the random list of documents for that term and return the list to the user. So that's the fastest way to respond to a single term here. Now the idea of the invert index is actually, basically, like that. We're going to do pre-constructed search an index, that will allows us to quickly find all the documents that match a particular term. So let's take a look at this example. We have three documents here, and these are the documents that you have seen in some previous lectures. Suppose that we want to create an inverted index for these documents. Then we want to maintain a dictionary, in the dictionary we will have one entry for each term and we're going to store some basic statistics about the term. For example, the number of documents that match the term, or the total number of code or frequency of the term, which means we would kind of duplicate the occurrences of the term. And so, for example, news, this term occur in all the three documents, so the count of documents is three. And you might also realize we needed this count of documents, or document frequency, for computing some statistics to be used in the vector space model. Can you think of that? So what weighting heuristic would need this count. Well, that's the idea, right, inverse document frequency. So, IDF is the property of a term, and we can compute it right here. So, with the document that count here, it's easy to compute the idea of, either at this time, or with the old index, or. At random time when we see a query. Now in addition to these basic statistics, we'll also store all the documents that matched the news, and these entries are stored in the file called Postings. So in this case it matched three documents and we store information about these three documents here. This is the document id, document 1 and the frequency is 1. The tf is one for news, in the second document it's also 1, et cetera. So from this list, we can get all the documents that match the term news and we can also know the frequency of news in these documents. So, if the query has just one word, news, and we have easily look up to this table to find the entry and go quicker into the postings to fetch all the documents that matching yours. So, let's take a look at another term. This time, let's take a look at the word presidential. This would occur in only one document, document 3. So the document frequency is 1 but it occurred twice in this document. So the frequency count is two, and the frequency count is used for some other reachable method where we might use the frequency to assess the popularity of a term in the collection. Similarly we'll have a pointer to the postings here, and in this case, there is only one entry here because the term occurred in just one document and that's here. The document id is 3 and it occurred twice. So this is the basic idea of inverted index. It's actually pretty simple, right? With this structure we can easily fetch all the documents that match a term. And this will be the basis for scoring documents for a query. Now sometimes we also want to store the positions of these terms. So in many of these cases the term occurred just once in the document. So there's only one position for example in this case. But in this case, the term occurred twice so there's two positions. Now the position information is very useful for the checking whether the matching of query terms is actually within a small window of, let's say, five words or ten words. Or, whether the matching of the two query terms is, in fact, a phrase of two words. That this can all be checked quickly by using the position from each. So, why is inverted index good for fast search? Well, we just talked about the possibility of using the two answer single-term query. And that's very easy. What about the multiple term queries? Well let's first look at the some special cases of the Boolean query. A Boolean query is basically a Boolean expression like this. So I want the value in the document to match both term A and term B. So that's one conjunctive query. Or I want the web documents to match term A or term B. That's a disjunctive query. But how can we answer such a query by using inverted index? Well if you think a bit about it, it would be obvious because we have simply fetch all the documents that match term A and also fetch all the documents that match term B. And then just take the intersection to answer a query like A and B. Or to take the union to answer the query A or B. So this is all very easy to answer. It's going to be very quick. Now what about the multi-term keyword query? We talked about the vector space model for example and we will do a match such query with document and generate the score. And the score is based on aggregated term weights. So in this case it's not the Boolean query but the scoring can be actually done in similar way. Basically it's similar to disjunctive Boolean query. Basically, it's like A or B. We take the union of all the documents that match at least one query term and then we would aggregate the term weights. So this is a basic idea of using inverted index for scoring documents in general. And we're going to talk about this in more detail later. But for now, let's just look at the question why is in both index, a good idea? Basically why is more efficient than sequentially just scanning documents. This is the obvious approach. You can just compute a score for each document and then you can then sort them. And this is a straightforward method but this is going to be very slow imagine the wealth, there's a lot of documents. If you do this then it will take a long time to answer your query. So the question now is why would the invert index be much faster? Well it has to do is the word distribution in text. So, here's some common phenomena of word distribution in the text. There are some languages independent of patterns that seem to be stable. And these patterns are basically characterized by the following pattern. A few words like the common words like the, a, or we occur very, very frequently in text. So they account for a large percent of occurrences of words. But most words would occur just rarely. There are many words that occur just once, let's say, in a document or once in the collection. And there are many such. It's also true that the most frequent the words in one corpus they have to be rare in another. That means although the general phenomenon is applicable, was observed in many cases that exact words that are common may vary from context to context. So this phenomena is characterized by what's called a Zipf's Law. This law says that the rank of a word multiplied by the frequency of the word is roughly constant. So formally if we use F(w) to denote the frequency, r(w) to denote the rank of a word. Then this is the formula. It basically says the same thing, just mathematical term. Where C is basically a constant and so, and there is also a parameter, alpha, that might be adjusted to better fit any empirical observations. So if I plot the word frequencies in sorted order, then you can see this more easily. The x axis is basically the word rank. This is r(w) and the y axis is word frequency or F(w). Now this curve shows that the product of the two is roughly the constant. Now if you look at these words, we can see They can be separated into three groups. In the middle, it's the intermediary frequency words. These words tend to occur quite in a few documents, but they are not like those most frequent words. And they are also not very rare. So they tend to be often used in queries and they also tend to have high TF-IDF weights. These intermediate frequency words. But if you look at the left part of the curve, these are the highest frequency words. They are covered very frequently. They are usually words, like the, we, of Etc. Those words are very, very frequent and they are in fact the two frequent to be discriminated, and they are generally not very useful for retrieval. So they are often removed and this is called the stop words removal. So you can use pretty much just the kind of words in the collection to kind of infer what words might be stop words. Those are basically the highest frequency words. And they also occupy a lot of space in the inverted index. You can imagine the posting entries for such a word would be very long. And then therefore, if you can remove such words you can save a lot of space in the inverted index. We also show the tail part, which has a lot of rare words. Those words don't occur very frequently, and there are many such words. Those words are actually very useful for search also, if a user happens to be interested in such a topic. But because they're rare, it's often true that users aren't necessarily interested in those words. But retain them would allow us to match such a document accurately. They generally have very high IDF. So what kind of data structures should we use to store inverted index? Well, it has two parts, right. If you recall, we have a dictionary and we also have postings. The dictionary has modest size, although for the web it's still going to be very large but compare it with postings it's more distinct. And we also need to have fast random access to the entries because we're going to look up on the query term very quickly. So therefore, we'd prefer to keep such a dictionary in memory if it's possible. If the collection is not very large, this is feasible, but if the collection is very large then it's in general not possible. If the vocabulary size is very large, obviously we can't do that. So, in general that's how it goes. So the data structures that we often use for storing dictionary, it would be direct access. There are structures like hash table, or b-tree if we can't store everything in memory or use disk. And then try to build a structure that would allow it to quickly look up entries. For postings they are huge. And in general, we don't have to have direct access to a specific entry. We generally would just look up a sequence of document IDs and frequencies for all the documents that matches the query term. So would read those entries sequentially. And therefore because it's large and we generally have store postings on disc, they have to stay on disc and they would contain information such as document IDs, term frequency or term positions, etcetera. Now because they are very large, compression is often desirable. Now this is not only to save disc space, and this is of course one benefit of compression, it It's not going to occupy that much space. But it's also to help improving speed. Can you see why? Well, we know that input and output would cost a lot of time. In comparison with the time taken by CPU. So, CPU is much faster but IO takes time and so by compressing the inverter index, opposing files will become smaller, and the entries, that we have the readings, and memory to process a query term, would be smaller, and then, so we can reduce the amount of tracking IO and that can save a lot of time. Of course, we have to then do more processing of the data when we uncompress the data in the memory. But as I said CPU is fast. So over all we can still save time. So compression here is both to save disc space and to speed up the loading of the index. [MUSIC]