Latent Dirichlet allocation (LDA) is an unsupervised machine learning algorithm that attempts to automatically divide a corpus of documents into groups of logical topics. The applications of this algorithm are numerous. Properly applied, LDA can identify the various topics discussed in any single document, categorize individual documents into their most dominant topic, or suggest documents that are intuitively related to each other.
LDA was originally proposed in a white paper written by David Blei, Michael Jordan (not that one) and Andrew Ng (my hero) back in 2003. It has been deployed into all sorts of academic and commercial projects since that time. LDA is supported in R via the "topicmodels" package and in Python via the the "gensim" package, making it a welcome addition to the toolboxes of every machine learning engineer working today.
In this article we attempt to provide the "big picture"; visualize what LDA does, look at examples where it succeeds, and identify situations where it struggles. In a future article, we may dig deeper into the mathematics behind LDA to better understand its nuances, but here we are focused on the "40,000 foot view".
Reddit, "the front page of the Internet", is a popular website that allows its users to share content into a series of subreddits. Each subreddit focuses on a specific niche of content, making it an ideal place to study the categorization of topics. The data analyzed below was collected between December 14, 2017 and January 12, 2018.
We collected 123,121 headlines from 29 different subreddits, as shown above. We gathered content from the "new" section of each subreddit, which includes both popular and unpopular content. Below we have visualized the 15 words that appear most often in the headlines of each of these 29 subreddits. The size of each "bubble" corresponds to the word counts in each subreddit.
Although adequate for the purposes of this article, this data is still in rough form. Before doing our final LDA analysis, we would need to spend more time with our text preprocessing. For example, the "relationships" subreddit contains odd words (e.g. "21f", "22f") due the the syntax of submissions (which includes age and gender). Beyond unnecessary stop words, the raw data contain all sorts of odd text fragments (e.g. "----------") that should also be removed.
Most important to consider: the amount data we've collected from subreddits such as "business" and "science" is dwarfed by the amount of data collected from subreddits such as "politics" and "gaming". Given such an imbalanced source of data, the topics selected by LDA may result in some odd choices (more on this later).
Below we have graphed the most common 30 words that appear in the headlines of the "Bitcoin", "gaming" and "politics" subreddits, along with their corresponding word counts. We have more than 10,000 headlines from each of these subreddits, and the topics of each subreddit are very different, as evidenced by the most common words in each.
In the /r/Bitcoin subreddit you will find discussions of "coinbase" and "blockchain". In the /r/gaming subreddit you will find discussions of "xbox" and "mario". In the /r/politics subreddit you will find discussions of "trump" and "mueller". This makes intuitive sense. Our human brains can easily understand that these groups of words clearly represent three distinct topics.
This is essentially how the LDA algorithm works, analyzing word distributions to divide a corpus into k number of distinct topics. What makes LDA so powerful is that it can perfom this task even without any prior categorization by humans.
One of the assumptions made by Latent Dirichlet allocation is that words which often appear together in the same document likely belong to the same topic. We must remain mindful of this when deciding how to define our "documents". For example, when it comes to this Reddit data, we might be tempted to treat each individual headline as an individual "document".
However, most headlines are extremely short, and they rarely repeat words of significance. That is, aside from common stop words (e.g. "the", "and", "be"), most of the words used in a given headline are unique. Consider the headline:
Durbin calls on White House to release tapes of Trump's remarks about African countries
Because each word is used only once, no word has a count greater than any other (they all equal 1). Without any unique word distribution, no useful pattern can be detected, and the LDA algorithm will fail to provide useful results.
In section 6.2 of "Text Mining with R: A Tidy Approach", authors Silge & Robinson offer a clever idea for testing the accuracy of LDA. They took the complete text of four books, divided into chapters, and treated each chapter as a unique "document". They then trained an LDA model to define 4 topics, categorized each "document" (chapter) into its dominant topic, and measured how many "documents" (chapters) were correctly matched to the same book they originally came from.
Their LDA model only misclassified 2 chapters. It correctly classified all of the remaining chapters from all four books.
Inspired by the work of Silge & Robinson, we treated all of the headlines from each subreddit as a single "document" when training the following LDA models. This provides a word distribution unique enough for LDA to perform properly.
We trained our first LDA model to define 3 unique topics, based on the 3 unique "documents" we defined above. Given these ideal training circumstances, our LDA model correctly identified 3 topics to match our 3 subreddits. We can see that the word distributions (β) of our LDA topics are a near-perfect match when compared to the word counts of the original subreddits.
When we sort the top 30 words in our LDA topics according to word distribution (β), we see the same top 30 words in our subreddits according to word count. Here we compare /r/gaming and topic 1 side-by-side:
It is worth noting that the topics returned by LDA have no meaningful names. Here we can see that topic 1 corresponds to "gaming", but that topic number is completely arbitrary, assigned based on the order in which LDA defined each topic, which is based on a random initialization. In other words, "gaming" could have been assigned to topic 2 or 3 just as easily. And depending on the veracity of your LDA model, the contents of the topic itself may be arbitrary too.
This is where data science becomes more art than science.
Deciding on the right number of LDA topics, and deciding if these resulting topics make any sense to a human reader, still requires a significant amount of human intuition, and failing that, trial and error. We avoided those issues here by supplying 3 distinct, pre-existing, human-defined categories, and asking for 3 distinct LDA topics in return. This makes for a clear demonstration, but the real-world applications of LDA are rarely this simple. In some cases we want a more nuanced view. For example, we may wish to look at the top 2 or 3 topics within each "document".
Our goal here is to divide each subreddit into one our 3 corresponding topics, and we do not want these topics to overlap. Once the LDA model has been trained, our gamma value (Γ) measures the distribution of LDA topics within each "document".
Here we find that nearly 100% of the words in our /r/Bitcoin, /r/gaming and /r/politics "documents" belong to LDA topics 3, 1 and 2 respectively. In other words, this LDA model has successully matched its topics to our original subreddits, just as we intended.
In addition to looking at topics-per-document, LDA also allows us to look at topics-per-word:
The word "mario" comes from /r/gaming, "mueller" belongs in /r/politics, and "segwit" is a term from /r/Bitcoin. This makes intuitive sense. We can also see how confident this distribution is. The only word even remotely contested is "mario", which has a minuscule chance of belonging in /r/Bitcoin according to this LDA model.
Recall the example headline from above:
Durbin calls on White House to release tapes of Trump's remarks about African countries
Although this headline was not used in training, our LDA model has already encountered many of these words:
Durbin calls White House release tapes Trump remarks African countries
With the exception of the word "release", which has been wrongly classified to /r/gaming, all of these words clearly belong to /r/politics.
Not only can we see which words are classified into which topic, we can also see the distribution of each word. We can measure the relative significance of each classification. Note the different scales of word distribution (β) in the above visualization. We can see that the word "release", the most confused word in this example headline, is insignificant when compared to the scale of the word "trump".
What happens when we give the above LDA model a headline that has noting to do with Bitcoin, gaming or politics?
Consider a headline from the /r/StarWars subreddit:
Liam Neeson seems open to returning as Qui-Gon Jinn in Obi-Wan Kenobi movie
Our current LDA model only recognizes two of these words:
Consider a headline from the /r/personalfinance subreddit:
Is there any valid reason to not max out retirement contributions if you can afford it?
Our current LDA model only recognizes four of these words:
reason max retirement afford
Our current LDA model isn't quite sure what to do with these, and none of the three answers it might provide would be correct anyway. Obviously we need more than 3 topics to encapsulate the wide range of headlines found on Reddit.
Recall that we collected 123,121 headlines from 29 different subreddits, but not in equal number.
Below we have visualized the 5 words that appear most often in the headlines of each of these 29 subreddits:
Once we include all 29 subreddits, our topics are not as clearly divided. For example, the word "trump" is dominant in several disparate subreddits, including /r/conspiracy, /r/environment and /r/worldnews. The word "christmas" is significant in /r/comics, /r/frugal and /r/gaming. Although these subreddits are not necessarily related, given the time at which this data was collected (December 14 through January 12) we can understand why the word "christmas" is appearing so frequently here.
We requested 29 topics from our new LDA model, the results of which have been visualized below. As with our first model, we treated all headlines from a single subreddit as a single "document" while training this LDA model.
Recall that once the LDA model has been trained, our gamma value (Γ) can be used to determine the dominant LDA topic of each subreddit. Below we have visualized each subreddit along with its dominant LDA topic number and gamma value (Γ):
Unlike our first model, we no longer see a perfect assignment of LDA topics to subreddits. Here our 29 subreddits were grouped into only 22 LDA topics, leaving 7 unused. Below we have visualized these LDA topics, which include their assigned subreddits, which in turn include the 15 words that appear most often in their headlines. The size of each "bubble" corresponds to the word counts in each subreddit.
Some of these groupings make intuitive sense. For example, topic 5 contains both /r/food and /r/cooking. Other groups aren't as clear. For example, topic 14 contains /r/Fitness, /r/environment and /r/science. We can understand why /r/environment and /r/science might be grouped together, but what is /r/Fitness doing in there? This can be partially explained by the low number of headlines from these subreddits relative to the others.
Let us take a closer look at the LDA topics that were not assigned as the dominant topic of any subreddits.
It looks like LDA topic 4 is a perfect fit for /r/StarWars. Why did our algorithm pick LDA topic 20 instead?
According to our model, a bit more than half of the words in /r/StarWars belong in topic 20, while a bit less than half belong in topic 4. This uncertainty is reflected by the relatively low gamma (Γ) of our /r/StarWars LDA topic assignment:
Remember: it is not always our intent to find a single dominant topic for a single given "document". When using this sort of "winner take all" approach to LDA topic assignment, we must keep an eye on gamma (Γ). Not all topic assignments are created equal. To help better visualize this, below we compare the 30 most common words in /r/StarWars headlines to their relative distributions (β) in LDA topics 4 and 20.
As we can see in the above visualization, LDA topics 4 and 20 are redundant.
Why did our LDA model create two topics about Star Wars here? We supplied this model with 123,121 headlines as training examples. Of these, 8,118 came from the /r/StarWars subreddit (roughly 7%). We asked our LDA model for 29 topics, and it provided 2 based on Star Wars (roughly 7%).
The subreddits with the largest number of headlines (/r/gaming, /r/politics and /r/Bitcoin) also resulted in redundant, unused LDA topics. We can see 2 unused LDA topics that would be a good match for /r/gaming, which is where 16,560 (roughly 13%) of our total headlines came from.
How we define our "documents", and the distribution of our training data, will ultimately determine the topics found by Latent Dirichlet allocation. The value chosen for k decides the number of topics created. If k is too high, we will find redundant topics. If k is too low, topics will merge together into messy jumbles of unrelated words.
Use the interactive visualizations below to break these 29 subreddits into different LDA topics, using the same methodology above. As we choose fewer and fewer LDA topics, our subreddits get clumped together into fewer and fewer groups. Pay attention to changes in gamma (Γ) as you try different k values.
Number of LDA Topics: 29
Use the below visualization to take a closer look. Click on topics to see which subreddits have been grouped together. Click on subreddits to see their top 15 words. Which of these groupings make intuitive sense? Which of these groupings are clearly wrong? Do the corresponding gamma values (Γ) in the above visualization match your human intuition?
Number of LDA Topics: 29
How many topics should we use to categorize these 29 subreddits? When it comes to Latent Dirichlet allocation, there is no perfect answer to this question. Successfully implementing an LDA algorithm in a real-world application requires some human intuition, and a lot of trial and error.
Your real-world solutions will depend largely on the context of your problems. Are you seeking a few, clearly distinguished topics? Would you prefer a more nuanced approach, where each document is viewed as a combination of different topics? How are you defining "documents"? What is the existing balance of topics in these "documents"? Have you spent enough time on your text preprocessing? Are there unusual stop words or other HTML artifacts to consider?
Most importantly: do your LDA results make any sense?
Whatever your application, the choices made in training will have a profound impact on the ultimate success of your project. A deeper understanding of Latent Dirichlet allocation will help you get to a better model, faster.
 Blei, Jordan, Ng. (2003) Latent Dirichlet Allocation. http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
 Blei, et all. (2017) Package ‘topicmodels’. https://cran.r-project.org/web/packages/topicmodels/topicmodels.pdf
 Řehůřek. (2017) gensim: topic modeling for humans. https://radimrehurek.com/gensim/index.html
 Robinson, Silge. (2017) Example: the great library heist. In Text Mining in R: A Tidy Approach. (section 6.2) Sebastopol, CA: O'Reilly. (ISBN 13: 978-1491981658; ISBN 10: 1491981652) https://www.tidytextmining.com/topicmodeling.html#library-heist
Did you find this article useful?
Connect with us on social media and let us know: