Introduction

Decision trees are awesome. Simple yet effective, they are easily visualized, intuitively understood, and a great place to start when trying to understand what this artificial intelligence stuff is all about.

Randal Monroe explains how its done
In all fairness, it requires some calculus too...
(Comic by Randall Munroe at XKCD)

Right now, out in the real-world, decision trees are being used to predict which customers will default on a loan, which credit card transactions are fraudulent, and which stocks are a good buy this week. This technology is already embedded all around us. Smart corporations have already been using this stuff for years, and now government is getting in on the action. As these systems become more sophisticated, and more embedded in every aspect of our daily lives, being an informed citizen means having at least a basic understanding of this stuff.

In this article we are going to try to build a machine that can successfully predict whether or not a person agrees with White Nationalists based on their responses to a simple poll.

Spoiler Alert: it doesn't work

We're also going to dig deeper into the logic behind decision trees, so we can better understand what they can do, and just as importantly, what they can't do.

Why White Nationalists?

For those unfamiliar with the term "White Nationalists", I'll say this: I hesitate to call them NAZIs outright in this article, but based on what I have seen after a cursory Google Image search, they seem to be fond of swastikas. Half of my family is Jewish. Ergo, I tend to get a little nervous whenever I see a swastika. I have a very strong anti-swastika bias, and I figure its only fair to get that out of the way before we begin.

A Thought Experiment

Imagine it is the year 1995. A terrorist who identifies as a "White Nationalist" has just killed 168 people, and injured another 680 more. The President proposes using machine learning to identify other White Nationalists, so that we can prevent further attacks.

Do you agree or disagree with this proposal?

And now we get to the real heart of the subject: beyond math, beyond code, this speaks directly to the very questions of right and wrong. As this technology becomes an essential (and valuable) part of our Democracy, it will require well informed and active citizens to keep this train on the tracks. And if you yourself are working as a machine learning engineer, it is even more critical for you to be aware of the ethical issues surrounding what we do.

As usual, we begin with the data:

Our Data

While looking around for some interesting data, I stumbled upon Cards Against Humanity's Pulse of the Nation. In October 2017 they polled 1,000 people with some fascinating questions, including:

  • Have you lost any friendships or other relationships as a result of the 2016 presidential election?
  • Do you think it is likely or unlikely that there will be a Civil War in the United States within the next decade?
  • Who would you prefer as president of the United States, Darth Vader or Donald Trump?
  • Who would you prefer as president of the United States, Darth Vader or Donald Trump?
    If you think 2016 was rough, you ain't seen nothin' yet...

Here is the question that inspired me to write this article:

  • From what you have heard or seen do you mostly agree or mostly disagree with the beliefs of White Nationalists?

I was shocked to find that 95 out of the 1,000 people polled (9.5%) answered "Agree" to the above question. Who are these people? Which characteristics do they share? Is it possible to predict whether or not a person "agrees with the beliefs of White Nationalists" based on their other responses to this survey?

The raw data has been provided via Kaggle and all of the code used to create the visualizations and other analysis used in this article is available via our Github repository.

Exploratory Data Analysis

This data is mostly comprised of answers to multiple-choice questions. Most features are categorical, a.k.a. discrete, which allows for an intuitive understanding of how a decision tree works. However, decision trees can also handle continuous features, such as age and income.

Below we have visualized this relationship:

Interactive: Hover over data for details or use Bokeh tools to manipulate the plot

The goal of our decision tree is to seperate the red dots (those who "mostly agree with the beliefs of White Nationalists") from the blue dots (everyone else). Looking at age (the x-axis) we can see the red group spans from 18 to 83. There is no obvious way to distinguish our red dots based on age alone. However, when we look at income (the y-axis) we notice something interesting. There is a not a single member of the red group who reports an income above $100,000 per year.

Reduced to a categorical feature, this is made nice and clear:

Interactive: Use tabs to switch between plots

When we visualize the results of a poll, we are usually interested in the number of people in each group. This is what we see in the first tab (Count). In terms of number of people the "No Answer" group is the largest by far (700 out of 1,000). However, our decision tree isn't interested in which group has the most responses; our decision tree is interested in how those groups split between our red and blue dots.

If we switch to the second tab (Purity) the red in these graphs now represents the relative percentage of red dots in each group. This is essentially how a decision tree decides how to classify something:

Example Decision Tree Visualization

In the above example we ask: "Do you earn $100,000 or more per year?". If the person responds "Yes" to that question (green arrow) we find there is a 0% probability of them belonging to the red group (0 out of 82) and can therefore classify them as part of the blue group. If they say "No" or refuse to answer (red arrow) we must continue on to the next question (to the next branch of our decision tree). We find that 95 of the 918 people in this branch (~10%) belong to the red group. We must look at additional features to distinguish them further.

Additional Demographics

When we look at our other demographic features we find some odd things:

Interactive: Use tabs to switch between plots

One might assume a higher percentage of Republicans would "agree with White Nationalists", but we find a near identical portion of Democrats, and a larger portion among Independents. On the "Gender" tab we find the highest percentage of our red dots among "Other" and "No Answer". On the "Education" tab we find the clearest split yet. Those who have attended some college, gotten degrees, or gone to grad school, are far less likely to belong to the red group. When we look at those who didn't go to college (or didn't answer) we find more than 1 in 5 of them agree with White Nationalists.

Perhaps most baffling is the "Race" tab. I assumed the "White" group was a no-brainer, and yet, only 8.49% of those who identified as "White" also "Agree" with White Nationalists. Compare this to ~13% of "Latino" and 16% of "Asian" respondents.

This begs an obvious question: why are Asians, Latinos and those who reject the gender binary more likely to agree with White Nationalists? Do these groups have a different interpretation of "the beliefs of White Nationalists"? Are people just giving us bogus answers? Data is never perfect, and this set is no exception.

Questions and More Questions

This is a fun data set because it offers us categorical features beyond the typical demographic stuff like age, gender and race. Remember: when we talk about "more red dots" in a given group we are speaking to the relative percentage in each group. Hover over the graph to compare these percentages with exact counts.

Interactive: Use tabs to switch between plots

Those who approve of Donald Trump's performance as President (Q1) are more likely to agree with White Nationalists than those who disapprove, but the difference between these groups is far smaller than one might expect. More than 95% of respondents reported that they love America (Q2), but the remaining 5% are less likely to be red dots. When it comes to government policies to help the poor (Q3) we find a nearly equal split.

If you think most white people in America are racist (Q4), you have a significantly greater probability of agreeing with White Nationalists. If you lost friends over the 2016 presidential election (Q5) you are less likely to be a red dot.

Interactive: Use tabs to switch between plots

If you think it is likely there will be another Civil War within the next decade (Q6) you are significantly more likely to agree with White Nationalists. When it comes to hunting (Q7) we find another nearly perfect split. If you've ever eaten a kale salad (Q8) you are less likely to be a red dot, compared to the anti-kale group.

When asked if they would vote for Dwayne "The Rock" Johnson, assuming he ran for President (Q9) we find that those who answered "Yes" are considerably more likely to agree with White Nationalists. This begs the question: are these 36 people aware that Mr. Johnson isn't white? Those with no opinion on the matter are least likely to be red dots.

On the subject of hypothetical politicians, these same people were asked if they would prefer President Donald J. Trump or President Darth Vader (Q10). Those who prefer Darth Vader are slightly more likely to agree with White Nationalists. One wonders: is this because they prefer those snazzy white costumes he makes his storm troopers wear?

The Decision Tree

In the above visualizations we talked about the relative percentages of those who agree with White Nationalists as we split them into different groups. This allowed us to intuitively understand the way things split in terms of probability. When it comes to decision trees, we can measure the value of these splits in terms of entropy and information gain.

To help better understand this let's look at a simplified example:

Example Decision Tree Split

We begin with 905 people who do not agree with White Nationalists, and 95 who do. We split them into two perfect groups, with all of our blue dots on the left, and all of our red dots on the right. We see that both of these groups have zero entropy because all members of these nodes belong to the same class. We define our information gain as the difference between our starting entropy (0.453) and the weighted average of the entropy of all child nodes (in this case zero).

Using sklearn

Warning: We are about to dig into some nitty-gritty details about Python and SciKit Learn

I am generating these decision trees using the SciKit Learn library. The code is elegant and simple:

from sklearn import tree
X = [[0, 0], [1, 1]]
Y = [0, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)

We import the library, pass our features (X) and labels (Y), and fit our model. What could be simpler?

As usual, the devil is in the details.

The sklearn library is extremely flexible. It can handle a list of lists (as seen in the sample code above), or a Pandas data frame, or even a NumPy matrix. However, sklearn's DecisionTreeClassifier cannot handle anything that is not a number. Even a single NaN will force it to choke harder than a chain smoker with pneumonia.

Consider this: most of the data we have here is not numeric at all. It consists of multiple-choice questions with categorical (discrete) features. How do we translate this text into numbers so that DecisionTreeClassifier can work its magic?

The key is to think in boolean terms.

Recall the question: "Do you earn more than $100,000?"

In order to translate this into a numeric representation, I simply assigned 1 to those in the "Yes" group, and 0 to everyone else.

Example Decision Tree Split

In the perfect example above, "Feature X" is simply 1 for those who "agree with White Nationalists" and 0 for everyone else. This numeric vector is identical to our labels. Ergo our decision tree can see a perfect 1:1 correlation between "Feature X" and the target label, and split perfectly. The question asks: "Is Feature X ≤ 0.5?". Those with "Feature X" = 0 are "True" (blue) and those with "Feature X" = 1 are "False" (red).

Warning: beware of the default DecisionTreeClassifier parameters

Most classes I have seen teach decision trees based on entropy and information gain, which is why I have done the same here. However, the default DecisionTreeClassifier criterion is Gini impurity which is mathematically different than entropy (follow the link for the formulas). When it comes to Python, we can specify entropy as follows:

clf = tree.DecisionTreeClassifier(criterion="entropy")

There are a few other "gotchas" to be aware of when it comes to tuning these parameters, which we will discuss below.

Defining Our Features

In the above example we saw a perfect split based on "Feature X". However, we are cheating. The infamous "Feature X" is simply our labels. In the real-world we need to use other data to predict these classifications automatically, without being told the correct answer ahead of time.

If you look around line 50 in 002_decision_tree.py you will see I defined 20 boolean features from our poll data, and kept the two continuous numerical features the data already contained (age and income). I was extremely selective in my definitions. I relied on the above exploratory data analysis, and my own intuition, to define features I believed would do the best job at classifying those who "agree with White Nationalists".

Take the time to dig into the code. If you do, you will see how data science can quickly become more art than science. To offer a specific example: I created only 4 boolean features based on race: "White", "Latino", "Black" and "Asian". I intentionally ignored other groups with the intent of making the resulting decision tree as simple and intuitive as possible, while forfeiting the least amount of relevant data. As we add features, the complexity of the model increases geometrically. As complexity increases, we run the risk of losing ourselves to a "black box", unable to clearly explain what our own models are doing anymore.

One Hot Encoding

It is not necessary to manually define these boolean features as I have done above. For example: imagine we had a Pandas data frame (dirty) with three categorical columns, three multiple-choice questions, each of which accepted three discrete answers: "Yes", "No" and "No_Answer".

The following two lines of code are all you need:

import pandas as pd
clean = pd.get_dummies(dirty, columns=["Column1","Column2","Column3"])

The result of this code will be 9 new features, all boolean. "Column1_Yes" will be set to 1 for those who answered "Yes" to the question in "Column1", "Column3_No_Answer" will be set to 1 for those who answered "No_Answer" to the question in "Column3", and so on. We can then pass our clean data frame directly to our DecisionTreeClassifier, and it will use the selected criterion to automatically determine which features are most important.

Growing A Full Tree

Let's begin with two features we discussed above: (a) Does a person earn $100,000 or more? (b) Have they gone to college?

Partial Decision Tree

The boxes above are still shades of blue because the red dots remain a minority in all of them. We can see two boxes with perfect purity (entropy = 0) where we have successfully separated all red dots. When we use this model to find members of our existing data who "agree with White Nationalists" we find it can predict them with a 90.5% accuracy, given just these two features. That sounds pretty good, doesn't it?

Spoiler Alert: it's not

Remember: out of 1,000 people, only 95 "agree with White Nationalists". This is the challenge of imbalanced classes. Consider the following function, which can also predict with a 90.5% accuracy:

def my_brilliant_function():
    # do absolutely nothing
    return(0)

So our model needs more features in order to improve accuracy. But which features are most important? And what is the best accuracy we can achieve?

Warning: income and earn100k are redundant features so we kept the continuous one (income)
Interactive: Use tabs to switch between plots

In the above visualization (Information Gain) we see the importance of each feature in our model. Because these features are being used in tandem, their value is very different here compared to our above exploratory data analysis, where we viewed each feature independently.

In the second tab (Cumulative Accuracy) we see how our overall accuracy peaks at 99.8% after adding the best 12 features. Try as we might, we continue to misclassify the last 2 people, no matter how many of our features we add. However, despite these shortcomings, the following decision tree is able to correctly classify 998 out of 1,000 poll respondents. Click on the image below to view the full decision tree as a PDF:

Full Decision Tree

Getting 998 out of 1,000 correct means we got 2 wrong. Who were they?

Feature User #614 User #753
Age 43 26
Income $0 (No Answer) $0 (No Answer)
Political Affiliation Strong Republican Strong Republican
Gender Male Male
What is your highest level of education? College degree College degree
What is your race? White White
(Q1) Do you approve or disapprove of how Donald Trump is handling his job as president? Approve Approve
(Q2) Would you say that you love America? Yes Yes
(Q3) Do you think that government policies should help those who are poor and struggling in America? Yes Yes
(Q4) Do you think that most white people in America are racist? No No
(Q5) Have you lost any friendships or other relationships as a result of the 2016 presidential election? No No
(Q6) Do you think it is likely or unlikely that there will be a Civil War in the United States within the next decade? Unlikely Likely
(Q7) Have you ever gone hunting? Yes Yes
(Q8) Have you ever eaten a kale salad? No No
(Q9) If Dwayne "The Rock" Johnson ran for president as a candidate for your political party, would you vote for him? No No
(Q10) Who would you prefer as president of the United States, Darth Vader or Donald Trump? Donald Trump Donald Trump
From what you have heard or seen, do you mostly agree or mostly disagree with the beliefs of White Nationalists? Agree Agree
Model Prediction Does Not Agree Does Not Agree
Error False Negative False Negative

These two people provided the same answer to every question except for "Age" and "(Q6) Do you think it is likely or unlikely that there will be a Civil War in the United States within the next decade?". Our model predicted that neither of them agree with White Nationalists, and was wrong on both counts.

Consider the following clever function:

def my_clever_function(user_id):

    # special case for these two users
    if(user_id==614 or user_id==753):
        return(1)

    # use the model for everyone else
    else:
        return(tree_model_prediction(user_id))

Problem solved! Now our model is 100% perfect, and our work is done. Right?

Spoiler Alert: nope

Why This Model Sucks

There are a few reasons why this model still sucks.

Most important: we are still cheating. In the above example we trained our model on all 1,000 responses, and then tested on these same 1,000 responses. In the real-world we want a model that can make good predictions given any data. What we have here is a model that is extremely good at making predictions based on this specific data. We have overfit our data. This model will start to fail once we give it data it hasn't already seen before.

This is also due to some of our parameters. Consider the size of each node. We split, and split again, and again, until we arrive at an answer. Consider the following node at the bottom of our tree:

Sample Overfit Node

By the time we split on age we only have two people left in this branch. In other words, this exact combination of features is correctly classified, but at the cost of becoming too specific.

Another concern: in my attempt to simplify our model features for the above example, I may have inadvertently removed useful information. For example: I treated "college" as a boolean for the sake of simplicity, but in truth, this is a categorical feature with 6 possible answers. Allowing our model to see that distinction may improve accuracy further.

So what can we do?

Generalize The Model

Our objective is to use this data to build a model that can successfully predict based on any data. In other words, we should be able to ask anyone these same questions, feed their answers into our trained model, and successfully predict whether or not they "agree with White Nationalists" with some level of reliable accuracy.

Let's begin by taking a fresh look at our features, using One Hot Encoding.

Interactive: Hover over data for details or use Bokeh tools to manipulate the plot

We now have 54 features, which contain all of the information available in this data. If we train and test on all 1,000 records we find the above information gain from these new features, and our model now achieves 100% accuracy. These new features are a good start, but we are still overfitting our data.

Splitting Our Data, Measuring Our Results

Imagine each of these 1,000 responses to the survey is a playing card. First we need to shuffle the deck. Then we'll take the first 200 cards and put them aside, and use the remaining 800 cards to train our model. This allows us to measure how well our model can handle data it hasn't seen before.

When we do this our accuracy drops from 100% to 85.5%. This is a better measure of how well this model will perform in the real-world. And among these 200 test cases, only 11 actually "agree with White Nationalists", so we can achieve 94.5% accuracy simply by predicting 0 for every person. This model still sucks. We need to take a closer look at our parameters, but we need to understand some key metrics first. Consider the following two models compared side-by-side:

Criterion Accuracy Recall Precision F1 Errors FP FN
gini 0.79 0.0 0.0 0.44 42 31 11
entropy 0.855 0.27 0.125 0.55 29 21 8

The top model (gini) is based on the default parameters, gini impurity. The bottom model (entropy) is based on information gain.

Accuracy is intuitive. What percentage of the 200 people in our test set did we classify correctly?

Recall asks: what percentage of the 11 "agree with White Nationalists" people in our test set did we correctly identify? Gini identified 0 of them. Entropy identified 3 of them.

Precision asks: of those labeled as a red dot, what percentage them actually "agree with White Nationalists"? In this case entropy labelled 24 people as red dots, and of these, only 3 actually agree.

F1 Score is a weighted average of recall and precision; higher is better.

Errors simply shows how many of the 200 each model misclassified.

FP is the count of false positives, the number of people who were labelled as red dots who shouldn't have been.

FN is the count of false negatives, the number of people who weren't labelled as red dots, but who should have been.

Warning: machine learning demands sound ethics

A key point to consider: not all errors are equal. Consider the above models. Both of these are far more likely to error on the side of false positives, that is, more likely to predict a person will "agree with White Nationalists" even if they don't. Imagine if the results of these models were going to be used to allocate law enforcement resources, or restrict people from flying on a plane. That would result in the harassment of a lot of people who have nothing to do with White Nationalists.

On the other hand, if we were trying to decide where to allocate educational resources, or who to poll for more information on the opinions of White Nationalists, we might decide it is reasonable to prefer false positives to false negatives. In the event of a false positive we have a person who shares some traits with those who agree with White Nationalists- nothing more, nothing less.

Needless to say, it would be extremely unwise and unethical for any person or organization to have blind faith in the classifications of a machine learning algorithm. These machines should help us think, not think for us.

Tuning Our Parameters

Ethics aside, our model still sucks. With only 85.5% accuracy, we'd get better results by predicting 0 every time. Recall the issue above: we are still splitting our tree into tiny branches, and thus, overfitting our data.

Sample Overfit Node

We can control this with the min_samples_split parameter, the minimum number of people a branch must contain to split.

min_samples_split Accuracy Recall Precision F1 Errors FP FN
2 (default) 0.855 0.27 0.125 0.55 29 21 8
25 0.91 0.09 0.11 0.53 18 8 10
50 0.945 0.09 0.5 0.56 11 1 10
100 0.94 0.0 0.0 0.48 12 1 11

As we increase min_samples_split the behavior of this model flips, and begins favoring false negatives. We also achieve parity with the "always guess 0" method. Better, but not impressive. And an obvious problem arises: if we continue fine-tuning the parameters to this specific 200 people in our test set, we may end up overfiting again.

The GridSearchCV function from sklearn can save us a lot of time and pain here:

from sklearn.model_selection import GridSearchCV

# try these parameters
parameters = {
              'criterion':['entropy','gini'],
              'max_features':[None,5,10,15,20,50],
              'max_leaf_nodes':[None,10,20,30,40,50],
              'max_depth':[None,10,20,30,40,50],
              'min_samples_split':[2,10,20,30,40,50,60,70,80,90,100]
}

# this example uses a tree model
model = tree.DecisionTreeClassifier()
clf = GridSearchCV(model, parameters)
clf.fit(X_train,y_train)

In the above code we send a dictionary of parameters to GridSearchCV, along with our training data and training labels. This function uses 3-fold cross validation and iterates through every combination of our selected parameters. It breaks our training data into three chunks, trains parameters on two chunks, tests parameters on the remaining chunk. It rotates through these chunks to look at the average results, and find the best parameters for general data.

Warning: we still want to keep 200 testing records completely seperate to avoid information leak

To help beter understand this, consider this illustration of 4-fold cross validation from Wikipedia:

4-Fold Cross Validation Example

The result of this function is a dictionary of our optimized parameters:

# get best fit parameters
print(clf.best_params_)
{
    'class_weight': {1: 1, 0: 1},
    'criterion': 'gini',
    'max_depth': 30,
    'max_features': 5,
    'max_leaf_nodes': 50,
    'min_samples_split': 20
}

Unfortunately it looks like our model has learned the same thing we already did: our best bet is to always guess 0.

Parameters Accuracy Recall Precision F1 Errors FP FN
Optimized 0.945 0.0 0.0 0.486 11 0 11

Conclusion

As frustrating as it can be, sometimes the best thing we can learn from an experiment is that we are wasting our time. Despite our best efforts, the most reliable option remains the "always guess 0" method. Let's take a closer look at our final model to better understand why:

Generalized Decision Tree

Aside from two of them, all of our terminal nodes (the ends of our tree branches) have a majority of blue dots, and thus, predict 0. To prevent overfitting, our generalized model is using only 20 of our 54 available features.

Interactive: Hover over data for details or use Bokeh tools to manipulate the plot

So where do we go from here? One option is to jump into the code yourself and embarass the author by creating a superior decision tree (if you do, be sure to tell us about it). Another option to consider, now that we understand how a decision tree works, is a random forest. As the name implies, random forests are an ensemble method based on multiple decision trees, and worthy of a full article on their own. Not to mention the countless other models we might employ here instead, each with their own pros and cons.

A final thought, and another warning against binary thinking: not all terminal nodes are equal. Consider this branch:

Sample Terminal Node

The left node (Gender_Male=0) contains 29 people. The right node (Gender_Male=1) contains 38 people. Both of these nodes contain 10 red dots. Both of these nodes will predict 0. However, members of the left node have a ~34% probability that they "agree with White Nationalists", whereas the right node has only a ~26% probability.

So instead of a simple "Agree" or "Disagree" classification, perhaps the better approach, the more ethical approach, is to consider the probability of agreeing. That certainly allows for a more nuanced deployment out here in the real-world, doesn't it?

Connect

Did you find this article useful?

Connect with us on social media and let us know:

Get notified when DataJenius publishes:

* indicates required