Okay. The last lecture, here we are. You have made it to the end of the line as the Travelling Wilburys sing. As you know, my favorite Beatle is in the Traveling Wilburys, so that is a great way to end the semester. Now, we're going to import a lot of stuff this time because we've got a lot that we want to do semantically, a lot of the usual suspects. We've got Matplotlib. We've got PyViz that we worked with last time. We're going to import Natural Language Toolkit. We're going to import some Stopwords. There's a lot that we can do in natural language processing to get down to those key features. Now up until this point, we've really only done preprocessing using TM toolkit. While that's a great package, we can do natural language processing outside of that package. That's what we're going to do here to just show you another way to do natural language processing. Same dataset, if you've already got it downloaded, you don't need to download it again because it is just that same tweet JSON dataset. Now what I've done for you here is I've created a set of different functions that do different things to data. What do you think the remove URL function does? Well, of course, it removes URLs from a tweet. What do you think the tokenization definition does? Of course, it lowers the texts and then it splits the words into tokens and then it returns those words to you so that you can continue to process. You can stem words using the stem function here. You can add stop. You can remove stop words from your data by using the stopwords function here. You can use the lemmatizer function to lemmatize texts. You can remove punctuation and you can remove words with a certain length. Here I have a process to say, if the length of a feature is less than one so if it's like ah or I remove it. If you want to remove all features with two characters or less, just change the 1 to 2. I have some functions here for you to do cleaning, and they all have different packages that they rely on. But this is a really good cell to hold onto because if you're ever doing natural language processing and you want to quickly take a sentence and process it, this is the way to go. We're going to load in the JSON format here. Now we're going to enumerate through the tweets just as we did before. We're going to pull up the text of the tweet and then we're going to go through a step of processes to put in the data. First thing I think we should do is just remove the URLs from the texts because they don't tend to really provide any real meaningful information and they can mess up the quality of a network analysis. Remove the URLs. Then I think we should tokenize the texts, which is again taking a sentence and putting it down into individual tokens. Then we should remove stop words from our tokens. Then we should lemmatize the text by getting them down to their root words, and then we should remove any extra punctuation that might be laying around after we did all that. There's some additional stuff we could do. We could print these things out as we go to just see how they change and that's why I leave these here. As you're manipulating data and preprocessing it, it's nice to see the stream come through and say, "Okay, now this is what the text data looks like." Is this actually filtering or removing what I want from the data, so on and so forth. Once I've gotten that feature set down to something that is really nice, I need to create a list of all of the words that I am going to put into my network analysis. This dictionary called unique words, it's going to just not only tell us what words are going to be in the network analysis, that would just be by looking at the keys of the dictionary, but it'll just be, the values will be the number of times that that word is mentioned. We've got 98,000 unique words and that's way too many to put in the semantic network analysis. Even with this preprocessing, we have more to go. What I really recommend you do, is now begin to play with the number of unique words that you want to include by removing uncommon words. Now, of course, you can remove most common words by reversing this logic. But it's pretty simple. You just go through the unique words dictionary, and if the value stored in that dictionary is greater than 300, we're going to use it. That's just a total heuristic. You could set that value to whatever you want. You just don't want more than 1,000 words in your semantic network analysis, it will already be crowded enough with 923 words. Again, anytime we iterate through a JSON file, we're going to open it back up, so we're going to open it back up. Remember, semantic network analysis is not directed by default, so we're just going to go ahead and open up the regular graph, not digraph, because there is no direction implied and we're going to iterate through our data again. We're going to open up our data. We've already done that, so we're going to go through the text and we're going to process it step-by-step just as we did. Remember on your project, this is where you should be spending some time going through and setting the parameters to get only the features that you want. Bonus points, if you're able to use parts of speech filtering, because that would be another way to remove even more features. So we're left with a set of edges. How do we get to those sets of edges? Let's just see, so we know that there's no punctuation texts, it's going to be a list of tokens or a list of things. Now, we need to make sure that we actually take that list of things and then turn it into semantic relationships. How do we do that? Well, first we're going to go through each tweet and we're going to say, let's make a list of words that we're ultimately going to make a semantic relationship work. For each one of those words that we have from this process up here, once we've done preprocessing, we're going to go through them and we're going to say, hey, if it's a word that we wanted to include that we decided in this prior step over here where we actually wanted to include it because it was used more than 300 times and we're going to go ahead and append that word to the good ones lists. Now, we're going to have a limited vocab for each tweet that corresponds to the words that we want to map semantically to our network analysis. We have a really cool one liner here that's basically coming up with all the possible combinations for the words in Logan words list. If we have two words in a list it's easy. If it's cat and dog, there's only one combination. If we have cat, dog and octopus, we have cat, dog, cat, octopus, dog, cat, dog, octopus, octopus cat, octopus, dog. I think those are all the possible combinations. But you can see with a list of three, it gets to be a mind bender with a list of four it's really hard to come up with all possible combinations, but this itertools combinations, uptakes a set of words and then gives you all possible combinations. We're specifying the two parameter here because we just want pairs, but if we wanted combinations of three, we would give it a three, so on and so forth. Now, for every combo that we have in our combinations, we're just simply going to make a row. We're going to say for a node in the combo, we're going to append those nodes to the list and then we're going to add that as a edge in our graph where we have Word 1 and Word 2. That's it. That's all we have to do to prepare our semantic network analysis. You're going to see here that we have a graph with 923 nodes and a tone of edges. With that ton of edges, we are left with the understanding that even though only 923 nodes in this network, these words are highly related and highly used with each other. The plotting and the database format looks exactly the same. I'm going to get some warnings here about some characters that it doesn't like. It's not going to actually display the python image here in this Notebook. The workaround to that is to just simply save it as a figure. If we save this semantic network as a figure it's going to look exactly like our last network. Only we're going to have words as nodes and edges representing the connections of those words.We also can use our Python data network as well. But be aware that this is pretty slow with 923 nodes. If you're going to run this click, "Run", you go get a cup of coffee, you come back in 30 minutes or so and you might be able to get this interactive visualization to complete. Because this is bugging in its current state, we're not going to ask you to do that for as part of the assignment, we just want you to use the Pythons network draw, basic network drawing, I'm sorry, not pythons. We just want you to use the basic network X drawing functionality to create your networks. But this is pretty much it. I mean, creating the network here is just a matter of coming up with those word pairs. The magic here, or where you should be spending your time to make things good is in the preprocessing. You should proud of preprocessing as much as possible, 923 nodes, as you'll see, is even unwieldy as is. You're still going to want to filter more to get down to more basic things. That's something that I really think is so important in making a good Data Viz of a network, you got to be contentional on how you filter it. That's pretty much it. That's semantic Network Analysis 101. I'm so excited for you to dive into this and remember, completing a network analysis is just the first step in creating something that is good. The next step is interpreting it. I've created a set of guidelines for you to interpret these networks that's in the rubric for the project. Take a look at it. Try to answer as many questions as you can in your presentation. You're going to have a really informative piece on how people mentioned Nike, Adidas, and Lululemon on Twitter. It has been a total joy for having you in the class, and I am so grateful that I get to teach this stuff to you. I am here if you ever want to reach out. We have a facilitator for the course as you know. Start there if you have any general questions, but if you want to talk or reach out or tell me something that you like or you didn't like, or something that you want to see more of in another class or in a future version of this class, please do let me know. It has been a true joy and a privilege to be able to offer this set of courses to you. I hope you find them helpful. Please do reach out and give me some feedback on how all this went. For now, I'm going to sign off and say, please enjoy the rest of your marketing data science coursework in finishing up your projects. Please enjoy the rest of your classes here in the MSDS at CU Boulder skill bus.