Topic modeling is a second technique for extracting meaning from text data. While sentiment analysis just gives you a value positive versus negative, topic modeling is about trying to figure out what are the different themes being talked about in large bodies of text, and being able to code each of the different pieces of texts to tell you which theme is present. How do we do this? Well, I think to understand, again, it's useful to think about, if you were just telling a friend who's zero emotionally intelligent academic, we're trying to explain to them. How do you do this? How do you go through all this text and extract the key themes? What might you do? Actually as a nice example here, in one of my classes, I asked my students to write a very short description of the culture of the last organization that they worked on. I've got a few of them here. Imagine that our task was to figure out, what are some of the main dimensions along which people describe organizational culture? What are the things that come up frequently when they describe the culture? What do we do? We can read through these, and after reading through them, we can start to see some common themes that emerge. I think there are. For example, a couple of them talk explicitly about work hard, play hard. That seems to be a dimension. Another thing that came up a lot was collegiality. People talk about being collegial, collaborative, all of those sorts of things, team-based, maybe impact would be another one. Some talk about big societal challenges and impact-oriented. So you start to see some themes emerge. What we're going to do is we start to list those themes. Now, our friend is wondering, but I see those in a few. I've got to go through like 500 or 1,000 of these and I've got to figure out how many of them are talking about collaborativeness, how many of them talk about work hard, play hard? How do I do that? Let's say, let's take each of these themes. For each of these themes, what we can do is you can build a little dictionary. We can identify words that are associated with it. For the work hard, play hard mind, work is going to appear a lot, but certainly play and hard if the theme describes play and it describes hard and has the word work in there also, that's probably about work hard, play hard. For collaborative, we're going to use words like collaborative, collaborate, team, collegial, all of those sorts of things. For each theme, we can create a set of words that are associated with it. That's the basic of what topic modeling does. What it does, it assumes that when we've got lots of different documents and lots of descriptions of the culture or the work or something that people describe. Each of those documents will contain a finite number of topics, so there's a few themes that occur in each of them. One person describing the culture will talk about work-life balance and impact. Somebody else might talk about it being collegial and hard work. Each document has only a couple of topics or small number of topics. Each of those topics is associated with a number of words. Again, if it's collegial, those words are things like team, collaborate, collegial, and so on. Now, we don't actually see the topics. If you think about what the computer sees, it sees all the texts. It sees here's each document, and each of those documents has a set of words associated with it. But then it tries to figure out what a set of associations between topics and documents and between words and topics will be most likely to lead to this distribution of words across topics. It uses just all the words in the documents to try and figure out which words belong together in topics, and in which of these topics is present in each text. That's all that it's really doing. Does have a couple of important limitations. I said that the computer doesn't see the topics it figures them out. Actually, it can do that for an infinite number of topics. What the computer doesn't do a good job of telling us is how many topics are present. One thing we usually have to do is tell it, okay, let's say there are 15 ways of describing culture. If there were 15 ways of describing culture, what words would be associated with each of those different themes, and which document would contain each of those different themes. That's one limitation, we have to tell it how many topics to look for. The second limitation is, it doesn't actually tell us what those topics are about. It just tells us there is a topic that appears in these documents and has these words associated with it. What we end up doing is looking at all of the words associated with the topic and say, okay, these are all the words, this has to be what the topic is about. It's a little fussy to do it in practice. But I have to say I've tried a few times, sometimes it works well, particularly in large documents. So it's a way of extracting key themes from large volumes of text and then being able to code each of those different documents and say, if we want to just pull out all the answers that talk a lot about work-life balance, this would be these, it can be very impressive. There's a nice example of this that was done in a study by some people at Stanford and Berkeley where they tried to measure culture across organizations. I don't know how many of you have come across the website Glassdoor. What Glassdoor is, it's a website where you can go and basically write about your company's culture, what it's like to be employed there. The idea is mainly for job seekers, so you can find out about every different organization. Because in order to learn about other organizations, it encourages you to describe your employer. Which is really cool, because what it means is suddenly we have millions of people who've written about their employers and so we can start seeing what do they like. In this study, what they did was they basically just pulled out every sentence that contain the word culture to try and understand. Again, when people write about culture, what are they writing about? They used about 100 topics, so there could be 100 different dimensions of culture. What I've got here is some examples of the topics that they came up with and the words associated with them. We can see some of them worked really well. For example, we have a topic on hostile management. What are the words associated with it? Well, obviously management and employee, that appears quite a lot, the names are hostile, unprofessional, abusive, favoritism, bullying, bad, horrible, rude disrespect. It seems like this is picking up a very clear theme. Work-life balance, work-life balance, good, healthy, flexible, personal. Some of the others seem a little strange, but again, you see throughout it's picking up similar words. You were able to start to pull out a bunch of these themes from different text. This is how we can use topic modeling in practice. We can ask our employees in pulse surveys and other short questions, just tell us a little bit about what you're liking and what you're not liking. What would you like us to know? Get fairly short answers from them, then we can end up with multiple thousands of these in an organization. By running topic modeling, we can very quickly identify all of those thousands, what are the themes that appear? How common are those themes? By looking at it over time, we can look at things like how are themes changing, which themes are becoming more common and less common. Which themes do we see being more common in each department? Not only can we talk about the sentiment, but even get a sense of what people are worried about and do this at scale over time in a way that yes, we could pay somebody to go through and read all of them, but we can do it so much more quickly and so much more effectively. When it comes to how we track engagement, machine learning has opened up some really interesting possibilities, particularly when it comes to finding different ways to get a sense of what people are feeling. With sentiment analysis, we have a great way of taking any form of texts that people are writing and getting a very quick pulse check on what's the overall level of positivity. With topic modeling, we can go way beyond that when people are filling out surveys, doing all things giving us open text and gives a great way to analyze that text and very quickly pick out the key themes that are most common and see how those are changing. These applications, I would say are still in the early days, we're seeing a number of companies starting to employ them, particularly when it comes to things like topic modeling, to analyze text. But in the future, I think there can be a very valuable tool when it comes to tracking engagement, and a nice compliment I would say to some of these more arduous annual surveys that we continue to see.