So I hope you sort of found it a little bit fun to do the inverted indexes the hard way using only SQL statements and inventing our own inverted indexes. But in reality, you don't want to do that. Really the whole purpose of that was so that you understood inverted indexes better. Meaning that you could, if left alone on a desert island with nothing but a many-to-many table, you could make this all work. But now I want to talk about the inverted index that we have built into Postgres and they can do so much more optimization than what we could do with the joins that we were using before and so it's always better to use these indexes. And there's basically two primary inverted indexes. There's a third one that's kind of like SP-GIST which is for spatial data. I wanted it to be like the Special GIST, but it wasn't, it's Spatial for things like data points in a three-dimensional place or latitude and longitude, things like that. So there's two of them, it's called the GIN, the Generalized Index and then there's the GIST, Generalized Search Tree. And just like in B-trees for normal indexes, the prefered text search-type index is just GIN. It has exact matches, it never looks at more rows than it needs to, just like B-trees, it's like this, very like look, find, got a list of rows, pull them in, those are the rows you need. They are very efficient on lookup and search and they can be costly when inserting or updating data, and so sometimes you sort of put all your data in and then you build the GIN later. That's a common technique in these inverted indexes. But if you're just inserting once in a great while, it's not that bad and so it's not terrible. So the GIN is like, if you ever no deep understanding of how you want your indexes to work, just use the GIN, be done with it. The GIST is a sort of reduced size, it uses hashing. And if you remember hashing, hashing throws away information but it makes the calculation simpler, but then more than one thing can hash to the same place and so then you've got extra work when the hash function doesn't spread stuff out the right way. And so it all depends on the nature of the data being hashed, how well the hash function works, but GIST is smaller and it's quicker to update. And so if you're grinding, if you're very update heavy, insert or update heavy, and the query performance is less important than the insert performance, because a lot like, say you're building a blog, right? How often do they put in new blog posts? Like once or twice a day? Fine. The querying it, the searching it, you're going to do over and over again. So the GIN is the fact that the insert is a little more expensive, the GIN. But if you're doing something like watching sort of a Twitter feed or something and pulling data in, and pulling data in, pulling data in, and then once in a while you do a query, well then a GIST is probably the better idea. So it's all really a balance of insert versus query. And a lot of the things in database indexes are a balance between insert and query. The downside of the GIST is that once it passes a value through it, like a word through it, it will get a list of blocks to read and then find rows in that block and it might read more blocks than necessary. The GIN will not read more blocks than necessary, it will read blocks that have the rows, that have the thing in it, that you're looking for. But what the GIST does is it, if it's got extra blocks, it doesn't know that but then it reads the blocks and checks the rows and says "Yeah, because of the hash I pulled in two of them but it's really only in one of them." And so it doesn't actually give you extra rows. You don't have to like add an extra WHERE clause, because within the WHERE clause, it still discards them. So you could think all that the WHERE clause is like, pull stuff in and discard stuff that we don't need. And so it just happens to read a bit more extra from the disk, it doesn't change it. So I'm not really going to spend a lot of time showing you examples of how you use GIST because it works exactly the same as the GIN. You just use it and it's all under the covers and it's all about efficiency and so we don't worry too much about that. So as we'll see later this concept of operations. The GIST and the GIN index know a lot about the data that they're indexing. And again, just like anything in databases, the more precision with which we can describe to the database engine the things that we want to do, the more efficient it can be. And so one of the things that the inverted indexes do is they have what are called operations and they need to know what kind of things they are going to operate and they will build subtly different indexes depending on how it works. Both of them work on arrays, which is sort of like stuff that's long and has pieces within it. That's the best way to think of arrays. And they use this operator, this less than at sign which is like contained within. Is this array overlap this array? So if this array has one thing, two things, or three things. This other array can be very large and if it has those three things among any of those all the things if there's an intersection, that's when it matches. And so let's take a look at how this works and you also tell it, oh, this is an array of integers, this is an array of text, etc. And the thing we're going to do is all based on text. So let's take a little tiny example here that we did in that whole long demonstration before that we did it by hand and it all basically. So we're going to create the exact same table. You can just drop these indexes or drop the tables. I just start each one. I would say DROP TABLE docs, DROP INDEX gin1 if you wanted to when you re-run these over and over and over again. So my examples have those actual drop commands in there. But here's the important thing, right? CREATE INDEX gin1 ON docs USING gin parentheses. And so there's an expression. Now the key here is this expression inside this index creation has got to to match the expression that we're going to use in the WHERE clause, because the WHERE clause is the essential part where we're trying to give a hint to Postgres as to how to throw rows away more quickly. I mean, how to throw blocks away more quickly. Okay, there's 10,000 blocks in here, but because I've used the index, I'm going to pull in three. And so this thing is like thousands of times faster. Right? But I'll admit, it took me a while to get all these things exactly right. But the thing that's important is that when you're making an index and you're making a WHERE clause, that this expressions match perfectly, right? It turns out in the SELECT, there's other things you can put as expressions in the SELECT. They're less significant and you can do more complex things. When we see ranking we're going to see this. The WHERE clause is where the cost of SQL queries happens. So the WHERE clause is very, very, very important. So we're making a generalized index, string_to_array on the column doc split by spaces and we're going to use text operators, okay? And that's just matching strings, okay? At this point, we're not doing natural language. We're going to do natural language in a bit, okay? And so this is really a very crude, very pure string-based, kind like a split. What you might do in Python to turn a stuff with blanks in it into literal strings. There's no sense that this is actual language, even though the examples I'm using are language, there is no understanding of language. We'll see that when we do this next. So we're saying okay, let's make an index and let's insert some stuff. And like always, the index is just being built while the inserts are happening and the index magically happens. And then we can do a SELECT and the key thing to this SELECT is it's using this less than at sign operator. So we have to pass our constant, as it were, to say "oh, here's an array." So that curly brace, learn, curly brace is a one-element array with the string learn in it, that's what that syntax is. I mean, technically that's curly brace, learn, double colon, text, open bracket, close bracket, which is converting it to a text array, but it's an array of one item. So that's kind of the serialized form of a text array. And then string_to_array of doc, this is to say, "Oh, remember that index I just made? You could run this on every row and then do the string_to_array on every row and check the WHERE clause. Or the fact that you've done an index and it gives you a clue to what rows you're dealing with." Right? So we can do an EXPLAIN on this same query. So if you do a query, it just happens and gives you the rows you're supposed to get with the word learn in it, The EXPLAIN, and it doesn't, let's see, how many, do we have any other? No. So you can do a EXPLAIN and that tells you the query plan and the key thing when you're looking at these query plans is again, it takes me a while to debug these. Just because I have these done in an instant in these lectures doesn't mean that it didn't take me a little while and some Googling and some Stack Overflowing to figure this out. But what you don't want to see is sequential scan. When you see sequential scan, that means that you failed. That means that you made this cool index, but whatever your WHERE clause was, it didn't activate the index. Because that's what you're trying to get. Now Heap Scan, Recheck, Index Scan, all those are good things and that's what you want to see that basically says that you have successfully made an index and you've successfully made a query that uses those index. And if this thing matters, you will take every query that you were ever going to use that you expect to use that index and you'll do an EXPLAIN on it to make sure that you're getting it right. So that's a lot of talk about very few lines. I mean, if you go through this, it's just like two lines that matter. It's the creating of the index and the array expression inside that index, and then there's the WHERE clause that uses the less than at sign. And so it's very simple, it's very elegant. You will find this so easy to use that you'll like use it whenever you need it and just say "Thank you very much, Postgres, for putting all this work in." So up next. Up next, clear all that. Up next, we're going to start switching from pure string indexes to natural language indexes. Okay? [MUSIC]