Introduction

Latent Dirichlet allocation (LDA) is an unsupervised machine learning algorithm that attempts to automatically divide a corpus of documents into groups of logical topics. The applications of this algorithm are numerous. Properly applied, LDA can identify the various topics discussed in any single document, categorize individual documents into their most dominant topic, or suggest documents that are intuitively related to each other.

LDA was originally proposed in a white paper written by David Blei, Michael Jordan (not that one) and Andrew Ng (my hero) back in 2003[1]. It has been deployed into all sorts of academic and commercial projects since that time. LDA is supported in R via the "topicmodels" package[2] and in Python via the the "gensim" package[3], making it a welcome addition to the toolboxes of every machine learning engineer working today.

In this article we attempt to provide the "big picture"; visualize what LDA does, look at examples where it succeeds, and identify situations where it struggles. In a future article, we may dig deeper into the mathematics behind LDA to better understand its nuances, but here we are focused on the "40,000 foot view".

Our Data

Reddit, "the front page of the Internet", is a popular website that allows its users to share content into a series of subreddits. Each subreddit focuses on a specific niche of content, making it an ideal place to study the categorization of topics. The data analyzed below was collected between December 14, 2017 and January 12, 2018.

Data collected from Reddit

We collected 123,121 headlines from 29 different subreddits, as shown above. We gathered content from the "new" section of each subreddit, which includes both popular and unpopular content. Below we have visualized the 15 words that appear most often in the headlines of each of these 29 subreddits. The size of each "bubble" corresponds to the word counts in each subreddit.

Interactive: Click on subreddits below to see the 15 most common words in each
Visualization in D3.js: Zoomable Circle Packing by Mike Bostock

Although adequate for the purposes of this article, this data is still in rough form. Before doing our final LDA analysis, we would need to spend more time with our text preprocessing. For example, the "relationships" subreddit contains odd words (e.g. "21f", "22f") due the the syntax of submissions (which includes age and gender). Beyond unnecessary stop words, the raw data contain all sorts of odd text fragments (e.g. "----------") that should also be removed.

Most important to consider: the amount data we've collected from subreddits such as "business" and "science" is dwarfed by the amount of data collected from subreddits such as "politics" and "gaming". Given such an imbalanced source of data, the topics selected by LDA may result in some odd choices (more on this later).

How LDA Works

Below we have graphed the most common 30 words that appear in the headlines of the "Bitcoin", "gaming" and "politics" subreddits, along with their corresponding word counts. We have more than 10,000 headlines from each of these subreddits, and the topics of each subreddit are very different, as evidenced by the most common words in each.

Top words per subreddit

In the /r/Bitcoin subreddit you will find discussions of "coinbase" and "blockchain". In the /r/gaming subreddit you will find discussions of "xbox" and "mario". In the /r/politics subreddit you will find discussions of "trump" and "mueller". This makes intuitive sense. Our human brains can easily understand that these groups of words clearly represent three distinct topics.

This is essentially how the LDA algorithm works, analyzing word distributions to divide a corpus into k number of distinct topics. What makes LDA so powerful is that it can perfom this task even without any prior categorization by humans.

Defining our documents

One of the assumptions made by Latent Dirichlet allocation is that words which often appear together in the same document likely belong to the same topic. We must remain mindful of this when deciding how to define our "documents". For example, when it comes to this Reddit data, we might be tempted to treat each individual headline as an individual "document".

However, most headlines are extremely short, and they rarely repeat words of significance. That is, aside from common stop words (e.g. "the", "and", "be"), most of the words used in a given headline are unique. Consider the headline:

Durbin calls on White House to release tapes of Trump's remarks about African countries

Because each word is used only once, no word has a count greater than any other (they all equal 1). Without any unique word distribution, no useful pattern can be detected, and the LDA algorithm will fail to provide useful results.

Summary of the LDA algorithm
Graphical model representation of LDA. The boxes are “plates” representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document. [1]

In section 6.2 of "Text Mining with R: A Tidy Approach"[4], authors Silge & Robinson offer a clever idea for testing the accuracy of LDA. They took the complete text of four books, divided into chapters, and treated each chapter as a unique "document". They then trained an LDA model to define 4 topics, categorized each "document" (chapter) into its dominant topic, and measured how many "documents" (chapters) were correctly matched to the same book they originally came from.

Their LDA model only misclassified 2 chapters. It correctly classified all of the remaining chapters from all four books.

Inspired by the work of Silge & Robinson, we treated all of the headlines from each subreddit as a single "document" when training the following LDA models. This provides a word distribution unique enough for LDA to perform properly.

Understanding our LDA model

We trained our first LDA model to define 3 unique topics, based on the 3 unique "documents" we defined above. Given these ideal training circumstances, our LDA model correctly identified 3 topics to match our 3 subreddits. We can see that the word distributions (β) of our LDA topics are a near-perfect match when compared to the word counts of the original subreddits.

First LDA model matches perfectly

When we sort the top 30 words in our LDA topics according to word distribution (β), we see the same top 30 words in our subreddits according to word count. Here we compare /r/gaming and topic 1 side-by-side:

/r/gaming and LDA topic 1

It is worth noting that the topics returned by LDA have no meaningful names. Here we can see that topic 1 corresponds to "gaming", but that topic number is completely arbitrary, assigned based on the order in which LDA defined each topic, which is based on a random initialization. In other words, "gaming" could have been assigned to topic 2 or 3 just as easily. And depending on the veracity of your LDA model, the contents of the topic itself may be arbitrary too.

This is where data science becomes more art than science.

Deciding on the right number of LDA topics, and deciding if these resulting topics make any sense to a human reader, still requires a significant amount of human intuition, and failing that, trial and error. We avoided those issues here by supplying 3 distinct, pre-existing, human-defined categories, and asking for 3 distinct LDA topics in return. This makes for a clear demonstration, but the real-world applications of LDA are rarely this simple. In some cases we want a more nuanced view. For example, we may wish to look at the top 2 or 3 topics within each "document".

Our goal here is to divide each subreddit into one our 3 corresponding topics, and we do not want these topics to overlap. Once the LDA model has been trained, our gamma value (Γ) measures the distribution of LDA topics within each "document".

The gamma of each LDA topic as it corresponds to each subreddit

Here we find that nearly 100% of the words in our /r/Bitcoin, /r/gaming and /r/politics "documents" belong to LDA topics 3, 1 and 2 respectively. In other words, this LDA model has successully matched its topics to our original subreddits, just as we intended.

In addition to looking at topics-per-document, LDA also allows us to look at topics-per-word:

The beta LDA clearly identifies which topic certain words belong to

The word "mario" comes from /r/gaming, "mueller" belongs in /r/politics, and "segwit" is a term from /r/Bitcoin. This makes intuitive sense. We can also see how confident this distribution is. The only word even remotely contested is "mario", which has a minuscule chance of belonging in /r/Bitcoin according to this LDA model.

Categorizing new headlines with our LDA model

Recall the example headline from above:

Durbin calls on White House to release tapes of Trump's remarks about African countries

Although this headline was not used in training, our LDA model has already encountered many of these words:

Durbin calls White House release tapes Trump remarks African countries

Durbin calls White House release tapes Trump remarks African countries

With the exception of the word "release", which has been wrongly classified to /r/gaming, all of these words clearly belong to /r/politics.

Not only can we see which words are classified into which topic, we can also see the distribution of each word. We can measure the relative significance of each classification. Note the different scales of word distribution (β) in the above visualization. We can see that the word "release", the most confused word in this example headline, is insignificant when compared to the scale of the word "trump".

Where LDA Struggles

What happens when we give the above LDA model a headline that has noting to do with Bitcoin, gaming or politics?

Consider a headline from the /r/StarWars subreddit:

Liam Neeson seems open to returning as Qui-Gon Jinn in Obi-Wan Kenobi movie

Our current LDA model only recognizes two of these words:

returning movie

Liam Neeson seems open to returning as Qui-Gon Jinn in Obi-Wan Kenobi movie

Consider a headline from the /r/personalfinance subreddit:

Is there any valid reason to not max out retirement contributions if you can afford it?

Our current LDA model only recognizes four of these words:

reason max retirement afford

Is there any valid reason to not max out retirement contributions if you can afford it?

Our current LDA model isn't quite sure what to do with these, and none of the three answers it might provide would be correct anyway. Obviously we need more than 3 topics to encapsulate the wide range of headlines found on Reddit.

Defining 29 LDA topics from 29 subreddits

Recall that we collected 123,121 headlines from 29 different subreddits, but not in equal number.

Data collected from Reddit

Below we have visualized the 5 words that appear most often in the headlines of each of these 29 subreddits:

Top 5 Words by Subreddit Top 5 Words by Subreddit

Once we include all 29 subreddits, our topics are not as clearly divided. For example, the word "trump" is dominant in several disparate subreddits, including /r/conspiracy, /r/environment and /r/worldnews. The word "christmas" is significant in /r/comics, /r/frugal and /r/gaming. Although these subreddits are not necessarily related, given the time at which this data was collected (December 14 through January 12) we can understand why the word "christmas" is appearing so frequently here.

We requested 29 topics from our new LDA model, the results of which have been visualized below. As with our first model, we treated all headlines from a single subreddit as a single "document" while training this LDA model.

Top 5 Words by LDA Topic Top 5 Words by LDA Topic

Which subreddit matches which LDA topic?

Recall that once the LDA model has been trained, our gamma value (Γ) can be used to determine the dominant LDA topic of each subreddit. Below we have visualized each subreddit along with its dominant LDA topic number and gamma value (Γ):

Gamma of Subreddit LDA Topic

Unlike our first model, we no longer see a perfect assignment of LDA topics to subreddits. Here our 29 subreddits were grouped into only 22 LDA topics, leaving 7 unused. Below we have visualized these LDA topics, which include their assigned subreddits, which in turn include the 15 words that appear most often in their headlines. The size of each "bubble" corresponds to the word counts in each subreddit.

Interactive: Click on LDA topics below to see the subreddits assigned to them
Visualization in D3.js: Zoomable Circle Packing by Mike Bostock

Some of these groupings make intuitive sense. For example, topic 5 contains both /r/food and /r/cooking. Other groups aren't as clear. For example, topic 14 contains /r/Fitness, /r/environment and /r/science. We can understand why /r/environment and /r/science might be grouped together, but what is /r/Fitness doing in there? This can be partially explained by the low number of headlines from these subreddits relative to the others.

What happened to the unused LDA topics?

Let us take a closer look at the LDA topics that were not assigned as the dominant topic of any subreddits.

Top 5 Words of Unused LDA Topics

It looks like LDA topic 4 is a perfect fit for /r/StarWars. Why did our algorithm pick LDA topic 20 instead?

Gamma of /r/StarWars

According to our model, a bit more than half of the words in /r/StarWars belong in topic 20, while a bit less than half belong in topic 4. This uncertainty is reflected by the relatively low gamma (Γ) of our /r/StarWars LDA topic assignment:

Gamma of /r/StarWars in context

Remember: it is not always our intent to find a single dominant topic for a single given "document". When using this sort of "winner take all" approach to LDA topic assignment, we must keep an eye on gamma (Γ). Not all topic assignments are created equal. To help better visualize this, below we compare the 30 most common words in /r/StarWars headlines to their relative distributions (β) in LDA topics 4 and 20.

/r/StarWars v. Topic 4 v. Topic 20

As we can see in the above visualization, LDA topics 4 and 20 are redundant.

Why did our LDA model create two topics about Star Wars here? We supplied this model with 123,121 headlines as training examples. Of these, 8,118 came from the /r/StarWars subreddit (roughly 7%). We asked our LDA model for 29 topics, and it provided 2 based on Star Wars (roughly 7%).

The subreddits with the largest number of headlines (/r/gaming, /r/politics and /r/Bitcoin) also resulted in redundant, unused LDA topics. We can see 2 unused LDA topics that would be a good match for /r/gaming, which is where 16,560 (roughly 13%) of our total headlines came from.

Unused /r/gaming, /r/politics and /r/Bitcoin LDA topics Data collected from Reddit

How we define our "documents", and the distribution of our training data, will ultimately determine the topics found by Latent Dirichlet allocation. The value chosen for k decides the number of topics created. If k is too high, we will find redundant topics. If k is too low, topics will merge together into messy jumbles of unrelated words.

How Many Topics?

Use the interactive visualizations below to break these 29 subreddits into different LDA topics, using the same methodology above. As we choose fewer and fewer LDA topics, our subreddits get clumped together into fewer and fewer groups. Pay attention to changes in gamma (Γ) as you try different k values.

Interactive: Use the slider below to split subreddits into different LDA topics

Number of LDA Topics: 29

Gamma of Subreddit - 29 LDA Topics

Use the below visualization to take a closer look. Click on topics to see which subreddits have been grouped together. Click on subreddits to see their top 15 words. Which of these groupings make intuitive sense? Which of these groupings are clearly wrong? Do the corresponding gamma values (Γ) in the above visualization match your human intuition?

Interactive: Use the slider below to split subreddits into different LDA topics

Number of LDA Topics: 29

Visualization in D3.js: Zoomable Circle Packing by Mike Bostock

Conclusion

How many topics should we use to categorize these 29 subreddits? When it comes to Latent Dirichlet allocation, there is no perfect answer to this question. Successfully implementing an LDA algorithm in a real-world application requires some human intuition, and a lot of trial and error.

Your real-world solutions will depend largely on the context of your problems. Are you seeking a few, clearly distinguished topics? Would you prefer a more nuanced approach, where each document is viewed as a combination of different topics? How are you defining "documents"? What is the existing balance of topics in these "documents"? Have you spent enough time on your text preprocessing? Are there unusual stop words or other HTML artifacts to consider?

Most importantly: do your LDA results make any sense?

Whatever your application, the choices made in training will have a profound impact on the ultimate success of your project. A deeper understanding of Latent Dirichlet allocation will help you get to a better model, faster.

Works Cited

[1] Blei, Jordan, Ng. (2003) Latent Dirichlet Allocation. http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

[2] Blei, et all. (2017) Package ‘topicmodels’. https://cran.r-project.org/web/packages/topicmodels/topicmodels.pdf

[3] Řehůřek. (2017) gensim: topic modeling for humans. https://radimrehurek.com/gensim/index.html

[4] Robinson, Silge. (2017) Example: the great library heist. In Text Mining in R: A Tidy Approach. (section 6.2) Sebastopol, CA: O'Reilly. (ISBN 13: 978-1491981658; ISBN 10: 1491981652) https://www.tidytextmining.com/topicmodeling.html#library-heist

Connect

Did you find this article useful?

Connect with us on social media and let us know:

Get notified when DataJenius publishes:

* indicates required