Recommendation engines are all around us. Netflix uses them to suggest which shows and movies to watch next. Amazon uses them to offer similar products to purchase. Sites like OKCupid and Tinder use them in an attempt to recommend true love, or at the very least, an occasional hookup.
In this article we are going to take a closer look at the logic behind these recommendation systems. Specifically, we’re going to talk about Non-Negative Matrix Factorization, Principal Components Analysis, some common issues in Natural Language Processing, data dimensionality, and the scalability of our solution in production. We’re also going to be talking about sex and dating, so hopefully that will hold your attention.
The great thing about this technology is the flexibility it offers us. Whether trying to recommend a movie, or a potential spouse, the underlying math remains the same.
According to an article published by Pew Research in 2015, online dating has lost its former stigma. Nowadays most people, especially young people, consider it a perfectly valid way to meet new friends. According to the article, a full 5% of those surveyed had met their significant other online.
Tinder remains among the most popular services in this space, despite the love-hate relationship many seem to have with it. According to Sean Rad, the CEO of Tinder, they use something they call the “Elo Score” to rate the desirability of all users. This explains how Tinder matches the beautiful people to each other, and also explains why people like me do so poorly there.
Looks matter a lot in the world of Tinder. Users "swipe" on each other based on profile pictures, rarely reading the provided bio until a match has already been made. In other words, Tinder is just as superficial as High School, just a whole lot more efficient. As far as machine learning is concerned, it might be argued that finding love on Tinder becomes a computer vision problem, with the mission of determining which users are the sexiest.
What if we lived in a world where looks didn't matter? What if our goal was to find the best possible matches based on nothing more than personality? Although that may sound like science fiction, this is exactly what we are going to try to do.
The analysis in this article is based on 59,946 user profiles taken from OKCupid in 2012. This data was released on GitHub by Adriana Escobedo-Land and Albert Y. Kim, and was the basis of OkCupid Data for Introductory Statistics and Data Science Courses, published in the Journal of Statistics Education. Permission to publish this data was granted by OkCupid president and co-founder Christian Rudder.
For the purposes of this article, we used the official data set above, as opposed to scraping new, fresher data to use. In May of 2016 a group of Danish researchers led by Emil O. W. Kirkegaard were doing similar research, and released a dataset of 70,000 profiles online without any official permission to do so. Kirkegaard had this to say:
Some may object to the ethics of gathering and releasing this data. However, all the data found in the dataset are or were already publicly available, so releasing this dataset merely presents it in a more useful form.
However, in a a scathing rebuttal by Michael Zimmer he argues:
Even if someone knowingly shares a single piece of information, big data analysis can publicize and amplify it in a way the person never intended or agreed.
Perhaps Ian Malcolm said it best:
Getting this data has become so easy that it is tempting to forget that privacy even exists anymore. Remember: even "anonymized" data can be problematic, as advancing machine learning methods become ever more adept at reconstructing these identifying traits from unrelated features.
Even if we don't publish it in academia, we still have access to it in industry. With great data comes great responsibility. Online stalking is a constant threat, and this is especially true of young women. In extreme cases, some very bad people have used data found in dating profiles to do horrible things, and it is easy to forget things like this when staring at data frames.
Simply stated: sound science demands sound ethics, and our industry is no exception.
For this article we focused on the 54,458 OKCupid profiles that contained bio text. Among these we find 32,648 men (60%) and 21,810 women (40%), ages ranging from 18 to 69 with similar distributions.
Of these 54,458 profiles, we find 46,672 identify as "straight" (85.7%), 5,188 identify as "gay" (9.5%), and 2,598 identify as "bisexual" (4.8%). Our goal is to match profiles based on personality alone, which means aside from gender and orientation, we are going to be focusing entirely on the contents of the profile texts. To start, we looked at the most common 20 bigrams appearing in the "essay0" section of the male and female profiles, and compared them each other:
Most of these bigrams appear in both groups, but those marked in red are unique to men or women. These differences between groups begins to show us how a matching algorithm might work. For example, the men like "video games" and "good food" while the women "love outdoors". Interestingly, although both groups value "sense [of] humor", women especially "love [to] laugh" and want someone who can "make [me] laugh".
If we use the same method to group by orientation, we can see different personalities begin to emerge. We could of course segment further, for example, comparing top bigrams of straight males versus straight females. However, in order to arrive at a functional model we will need a more systematic approach.
How does one perform math on words? We need to begin with a term-document matrix.
Each row in this matrix corresponds to a document, in this case, a specific user's profile. Each column corresponds to a specific term. In between we have 1's (indicating that the document contains the term) and 0's (indicating it does not). From here we can use our usual linear algebra tricks.
Note that bigrams such as "rock climbing" are considered a seperate term, unique from both "rock" and "climbing". There are roughly 70,000 unique words in the "essay0" text of these 54,458 profiles. Therefore the resulting matrix has roughly 70,000 columns (too large to fit on this screen), and is "fatter" or "wider" than it is "long" (it has more columns than rows).
The above graph appears somewhat logarithmic, and this makes sense. Given a finite number of unique words, eventually we would encounter them all, and our matrix would achieve its maximum width. From that point on our matrix would grow "longer" as new profiles added rows without the need for any additional columns.
However, in the real world, each typo, each bit of slang, each misspelled word, is considered a unique term. Bigrams can easily expand your total vocabulary exponentially, as each unique pair of words appearing beside each other creates a new term, and new column.
If the average profile has 100 unique words, and we have 70,000 unique words total, it stands to reason that 99.9% of each row will consist of nothing more than 0's. This is what we call a "sparse matrix", with vast areas of nothing between a few bits of precious data here and there. Although a comprehensive mathematical model of our text, it is far from the most efficient.
What does it mean when we talk about the dimensionality of data? Consider the following example:
|User||Feature X||Feature Y|
This is a classic 2-dimensional data set, and can be easily visualized in a 2D space:
Given a term-document matrix with 70,000 unique terms, we are dealing with a 70,000-dimensional data set, with 70,000 features to consider. Not only is this impossible to visualize in any meaningful way, any calculations become considerably more expensive as we add dimensions.
Given the simplified 2D example above, could we somehow reduce it to a single dimension?
As we can see, these two features have a strong positive correlation, marked with the red line.
Imagine rotating the x-axis to follow this correlation, and centering our y-axis on this new mean.
This is essentially what Principal Components Analysis does, combining our two features into a single new one. The majority of information can now be measured along the new x-axis alone, allowing us to effectively discard the y-axis. Victor Powell has created an amazing interactive PCA visualization tool to help better understand what PCA does.
This process can be used recursively, combining 4 features into 2, and those 2 into 1. This allows us to reduce a ridiculous amount of dimensionality (e.g. 70,000 features) into something more sensible. This also has the added benefit of showing us which features are the most important, by finding where the majority of our useful information lies.
There are some disadvantages also. We lose information as we lose dimensions; how much depends on the specific correlations of your data. More challenging: these component features are not easily interpretable. For example, if we reduce "age" and "height" into a new feature "ageheight", what does that number mean? How would you explain it to a client or customer? Taken to an extreme, your model can become a mysterious "black box" impossible to clearly explain.
Like PCA, Non-Negative Matrix Factorization (NNMF) reduces the dimensionality of data. Unlike PCA, the component features it creates are more easily interpretable. For the purpose of our "love recommender" we are going to use NNMF to kill two birds with one stone: first we will reduce the massive dimensionality of our NLP, and second we will use the similarities among these reduced features to find good matches, that can be interpreted and explained.
This 2-for-1 benefit of NNMF makes it an excellent candidate for deployment to production environments, where speed and computational cost are critical.
It is difficult to explain the specific differences between NNMF and PCA without delving deeper into linear algebra, which is beyond the scope of this article. However, DataCamp's Unsupervised Learning in Python class found a great way to visualize the difference, which I have reproduced below.
We gave both algorithms samples from the LED Display Domain Data Set provided by UC Irvine. These are images taken from an LED display, and look like this:
When PCA reduces dimensionality, the resulting components don't make much sense:
By comparison, NNMF recognizes that the LED display is made up of distinct pieces:
All letters and numbers can be represented on an LED by combining these components back together. In other words, both algorithms reduced the dimensionality of these images, but only NNMF reduced it into component parts that make sense. Even when we look at these components individually, we can interpret them, and infer how these pieces form the whole.
Applied to our NLP problem, the NNMF algorithm will reduce our massive vocabulary into a small number of components that repesent the "key concepts" of these profiles. In addition to recommending users to each other, this also allows us to better understand the sorts of groups these different users fall into.
Once we have reduced our term-document matrix to component "key concepts", we can represent all individual users as a vector. To see how similar any two users are, we simply do a little math:
Before we play matchmaker, we need to complete our model. Before we were looking at bigrams based on count. Here we will use tf-idf instead, which looks at how important a given bigram is to a given profile. This works the same as our matrix of 0's and 1's, except with a tf-idf score for each term-document pair (e.g. 0.1234). We're also going to add some custom stop words, and ignore any bigram that appears less than 10 times. This results in a vocabulary of 28,071 bigrams. Once we have our tf-idf matrix, we fit it to our NNMF model, and reduce it to 20 components. We extract our needed information to a few data frames, and save everything in pickle files.
Running on a MacBook Air with 4GB DDR and a 1.6 Ghz Intel Core i5, it takes 7 seconds to import and format the data from raw CSV, 24 seconds to create our tf-idf matrix, and 29 seconds to fit our NNMF model. Once we have these pkl files, we can suggest matches for any given user, from a cold start, in less than half a second. When it comes to production environments, given its speed and effeciency, NNMF is a great choice.
Our first contestant is a 31 year old straight female who has been matched to a 31 year old straight male. Take a look at the following two profiles. Does this look like a good match to you?
|#3971, Straight Female, Age 31||#42531, Straight Male, Age 31|
i'm a lady living in the city with my dog. i tend to work too much,
sleep too little and drink enough. i enjoy the occasional
misadventures... drinks with friends, getting all decked out in
sequins and seeing a drag show, dim sum sundays, taking the dog to
the park, pumping a jukebox full of hall& oates, going to the
beach and picking fights with the ocean via board (i lose) or just
to lay in the sun and drink beer, throwing rocks from glass houses,
trips to anywhere (usually a city where gambling is legal, even
though i don't gamble but where there is gambling there are
buffets) naps, watching zombie movies, saying the wrong thing, open
bar (cash bar be damned!), cooking/baking for anybody
&everybody, running, being cautiously optimistic, and
occasionally indulging in a trashy gossip magazine.
i am laughing often, sometimes clumsy, and always charming
|i bake bread for a living and i like watching and following sports. i stay active enough to feel like i'm not out of shape. i'll do cardio at the mini gym at work because it's free, i'll ride my bike and i also like to play basketball. i like staying home to watch movies or surf the web. i read the newspaper a lot. i finished reading a novel for the first time in a year or so recently and i'm onto another book. i like being outside a bunch too, but no xtreame sports. i also like to go out for food and drink. i like trying new places. why else live in this city right?|
One of the major challenges with NNMF, as with all unsupervised machine learning algorithms, is that we don't have any way of calculating an objective accuracy rate. According to our model, these two users have a cosine similarity of 97%, but what does this mean in the real world?
Results must be evaluated empirically. Once again data science becomes more art than science.
Behind the scenes, this match looks like this:
|Component||User #3971||User #42531|
The vectors for both of these users is based on their normalized tf-idf scores for each of the 20 components defined by NNMF. We can see both of these users have strong scores for component 0, and since NNMF creates interpretable components, we can dig deeper and view the bigrams associated with this component:
|Component 0 Bigrams|
Let's take a closer look at a few other matches to get a better understanding.
Our second contestant is a 29 year old straight male who has been matched to a 21 year old bisexual female. Look at the profiles. What say you? Is cupid's arrow flying?
|#31700, Straight Male, Age 27||#4313, Bisexual Female, Age 21|
|i'm a geek of the "i find everything fascinating" fashion, as opposed to the "i write code" fashion, though i do that too. i live for good conversation, new experiences, and new ideas. i'm an ardent moderate. i put things together quickly. i'm very intelligent and somewhat humble. i like nature, but i'm not "outdoors-y". i'm a foodie. i can't dance, but i want to learn. i swear a lot. i'm not creative, but i appreciate art. i'm not 'fit', but i'm working on it. i normally don't write in sentence fragments.||i laugh at all the wrong moments. i say what i mean to my own detriment. i am queer, radical, and straight edge. i like art shows & urban exploration. i believe in herbal teas more than modern medicine. i also enjoy a little magical thinking from time to time.|
Although these users sound very different, both rank highly for component 15:
|Component 15 Bigrams|
It would appear that the girl who "laughs at all the wrong moments" has been put into the "geek" group. Is this a case of opposites attract, or does the computer see similarities here that our human brains are missing?
Our third contestant is a 24 year old gay male who has been matched to a 27 year old gay male. How does this algorithm perform when given less options to match to?
|#7213, Gay Male, Age 24||#13418, Gay Male, Age 27|
hi! i'm [name removed]!
i was born in southern california and raised in kapalua, hawaii for 15 years and went to the oldest catholic grade school west of the mississippi. it was pretty neat! i loved my childhood...i live for hawaiian food and it's a miracle that i don't weigh like 300 pounds! i guess that i can thank my swimming & water polo career for that. i also taught sailing at the lahaina yacht club which definitely helped blossom my passion for the ocean and my love for sailing!
i moved back to southern california to finish high school and was accepted to the united states naval academy at annapolis for college which was one of the hardest experiences of my life. i loved my navy experience and was able to meet some of the most amazing people... now on for the next part of my life's journey! :)
here are 6 random tidbits about me:
1. i have a lot of inside jokes that i'm the only person inside of. but i find myself making reference to them in conversations as if everyone knows what i'm talking about.
2. when i was in 5th grade my elementary school had a "dress as your favorite black hero/shero day." i reminded my parents everyday for three weeks that i needed a costume, but of course they forgot to get one. but instead of doing the respectable thing and letting me stay home for the day, they pulled out an old dashiki from my father's closet and made me go as, and i quote, "general colin powell....at home."
3. in the past 6 years i've been to about 30 countries (and i worked in china and hong kong for a couple years). but i scaled back when i started law school, which is probably a good thing: after a while every place looks the same.
4. on every job i've had i've reached a day (or a month) when i've seriously considered calling in a bomb threat so that i wouldn't have to come into the office that day.
5. when i was young i would try to give my allowance to the poor. only, i didn't know who was poor. so i would walk up and down the street stopping people, "excuse me, are you poor?" and, of course, i got cursed out hundreds of times. "f*** you!" "hell no!" "do i look like i'm poor?!" no good deed goes unpunished.
6. politically, i'm very liberal (perhaps even radical) but personally i'm pretty conservative. for example, i think if people want to be in open relationships or experiment with drugs, they should be free to do so but i don't want either for myself. i like to think that i add a touch of blandness to a world full of flavor.
Both of these users rank highly for component 8:
|Component 8 Bigrams|
Component 8 seems to focus on people who have moved to the bay area, and excelled in their education. There is a certain logic here, with a graduate from Annapolis being matched to someone who went to law school.
We used bigrams instead of single words throughout this article, however, in practice we would get better results by using both. Although this increases training time somewhat, once trained, we can still get recommendations in a fraction of a second.
We can also refine these matches by creating more NNMF components. Instead of 20, we could try 50, or even 100. The more "key topics" we divide profiles into, the more specific the matches can become. We are only using a fraction of the available text data in these profile. Using more text, divided into more categories, will provide a more nuanced view of these users, at the cost of slower run times.
You can easily modify these variables in the code and test the results for yourself.
Using our NNMF model to play matchmaker is a lot of fun, and we invite you to get the code and play with it yourself. Simply designate a user id and the script will output the top 3 matches. Note that the top match will always be the same user, as they are the best possible match to themselves. Assuming the top few matches do not correspond to the preferred gender and orientation, we can easily dig deeper into the results until we find someone who does.
When it comes to making recommendations based on nothing but raw text, Non-Negative Matrix Factorization is hard to beat. It offers a powerful one-two punch, reducing the dimensionality of giant, sparse term-document matrices into manageable topics, and then suggests matches based on an incredibly efficient cosine similarity calculation.
Given the speed and power of NNMF, it is an excellent tool to consider the next time your are deploying a recommendation engine to a production environment.
Did you find this article useful?
Connect with us on social media and let us know: