So that was pretty complex, and I hope you learned some SQL and had a little bit of fun, knowing that you did not have to depend on Postgres for your text inverted language-aware natural language index using GIN and all these other things. But the reality is, as we well know, the easy way to do this is just to let Postgres do all the hard work. Postgres already knows all these languages and away it goes. So let's talk about the concepts required that you need to understand, and then how you make these indexes work. So there are two really important functions, and they are actually data types as well. There's this notion of a ts_vector. So the ts_vector is a text string vector. I don't know if I should explain why they're call it vector, I guess I will explain why they call it vector. If this next thing doesn't make sense to you, don't get too excited about it. I'm not going to ask you this like on an exam or anything. So the idea is, is you place a document in a n-dimensional space. Right here, I've got a three-dimensional space because we got a little bad three-dimensional space that I've written, and so you can take and put things at various places in this three-dimensional space, and then you're kind of looking for closeness in three-dimensional space. And that's how you do it, but it's not just three dimensions, it's n-dimensional. And so this vector is the vector from the origin to the particular document that you've got, and so you're trying to cluster in a n-dimensional space. And if none of that made any sense to you, don't worry about it. It really doesn't matter. But that's why they call them vectors. You could also think of it as it's a fancy array, because a vector is a sort of a fancy array. So there is this function in Postgres called to_tsvector, and you'll notice that it takes two parameters. The first parameter is the language that we're working. Remember, I told you how important language is, and we all showed you that there are many languages that Postgres already knows. I'm just happening to be using English, and then the string. So that basically says, hey, that's an English string. Let's extract its features. Let's extract the nature of that string, stemming, case, etc., eliminating stop words. That's all in there. Right there is all the work we did with stemming and all that stuff. It also keeps track of the locations and there's a whole chunk of weighting that we don't have to worry about, but there's weighting, and that might have to do with you want to weight the title more than you want to weight the description. You can even concatenate rows together. I mean concatenate columns together to kind of create metacolumns, but really all it does is it extracts the essential features of a phrase or a document of text and the positions. And so you'll see later when we get to ranking that the position matters a little bit. And you can even have things like, I want to search for Python followed by SQL, or SQL followed by Python. Now, in this particular that's probably a bad search, but the ordering can matter. And so these ts_vectors keep track of the order of the things in the document. If you watch the Matt Cutts video, you'll see that sometimes if you say Python SQL and it's closer together, that actually means more than if it's farther apart. So distance between the words, if you're searching for more than one word. Okay? And so that's what the ts_vector does. And the ts_vector generally is the thing that we index on. So if we go back to what we did before, we indexed on a string array. So now we're indexing on this ts_vector, and Postgres knows everything there is to know about ts_vectors because they invented them, and the indexes are very smart about ts_vectors, okay? And then when we're going to do a WHERE clause, we create a thing called a ts_query. And the ts_query basically says, break, apply stemming rules, apply language rules, apply stop word rules. You can have more than one word. This is a real simple one, and you can see in this case when I do a SELECT to_tsquery english teaching, you see that the result is a stemmed version of the word teaching. So teaching stems down to teach. And so you can also see that the stem happened from UMSI teaches Python SQL, the teaches stemmed down to teach. But then in the query, we've got to stem it down as well. It turns out this whole idea of stemming of the queries, there's a technical term for that and it's in one of these Wikipedia things. It's called conflation. I don't exactly know why they picked that. That's a cool word. I've never used conflation in a sentence. So the applying of the stemming rules from a ts_query, from the input to the ts_query to the output of the ts_query, is using conflation. So I get to use conflation in a sentence. So that's called conflation. And so then what we do to make a WHERE clause is we combine, the most typical WHERE clause, is to combine a ts_query and a ts_vector. And you could actually say, is this ts_vector equal to that ts_vector. That works too. But this double at-sign operator is the cool thing. And you basically say, hey here's a query, to_tsquery english teaching. Is that sort of contained within the to_tsvector in English of UMSI also teaches Python and SQL. And in that case, it's true. So it says that query matches that text, even though teaching is not in the string, but teaches is in the string. So you see how this sort of softer but clever matching is there. So I think that's pretty cool, and just show ts_vector takes any document, reduces it to its essence including position, and then ts_query takes any string and reduces it to its essence in a way that can be matched to a ts_vector. You did this the hard way in SQL, now we're going to do it the easy way in Postgres. So we're going to create our same table, docs with just one column of a document, and then we're going to create GIN using to_tsvector, that was that function that converts a column into a smart array with all the stemming, etc., etc.,and we have to tell it what language we're dealing with. That's why you need to know which language these things are, and you'll notice that this is actually simpler than the string one. Because I didn't have to say the operators. And that's because like the GIN and the ts_vectors are good friends and they know what's going on. There's some ops for it, but it's like, oh, it's a ts_vector. I know what you're going to be doing. You're going to be doing the double at-sign operator. I know exactly what you want to do. Here you go. I'm taking care of it. So this is in a way. The natural language is simpler than the pure string generalized index. And we're just using GIN. You could use GIST. Again, going back to that, just use GIN unless you have a reason to use GIST. So we throw some things in there, and it fills up the index, and then we're doing a WHERE clause, and we can say to_tsquery english learn @@ to_tsvector english doc. Now again, if this didn't have an index, we would simply retrieve every row, compute the to_tsvector, and then check to see if it double added the ts_query. But that's not what we want. And of course, we find the row, row 2 is the one that matches the learn, right? And then if we do an EXPLAIN on it to_tsquery, EXPLAIN SELECT id doc FROM docs WHERE to_tsquery english comma learn, double at-sign to_tsvector english doc, that is where it's a WHERE clause, and the WHERE clause matches the expression that's in the WHERE clause with the expression that was in the index and says, you know what, I can use that index. And so that's again when you fail and get a sequential scan, sequential scan is not our friend. No sequential scan, but instead we got a nice little index scan, heap scan, anything but sequential scan in general is a success. And so again, you're going to do this, and you do these over and over and over in little sample examples until you get it right, and then you're going to want to do an EXPLAIN on every single query that you're using to make sure you sort of didn't fall apart, right? You didn't like add an AND clause, or who knows what it is that you're going to do, okay? So that pretty much gets us to the point where we have created a index, and we can query that index, and it was pretty easy. Up next we're going to see how you're going to rank the results. Again, kind of taking a cue from Google and the other search engines. [MUSIC]