Introduction

Recommendation engines are all around us. Netflix uses them to suggest which shows and movies to watch next. Amazon uses them to offer similar products to purchase. Sites like OKCupid and Tinder use them in an attempt to recommend true love, or at the very least, an occasional hookup.

I suppose its worth a shot

In this article we are going to take a closer look at the logic behind these recommendation systems. Specifically, we’re going to talk about Non-Negative Matrix Factorization, Principal Components Analysis, some common issues in Natural Language Processing, data dimensionality, and the scalability of our solution in production. We’re also going to be talking about sex and dating, so hopefully that will hold your attention.

The great thing about this technology is the flexibility it offers us. Whether trying to recommend a movie, or a potential spouse, the underlying math remains the same.

Finding Love in a Digital World

According to an article published by Pew Research in 2015, online dating has lost its former stigma. Nowadays most people, especially young people, consider it a perfectly valid way to meet new friends. According to the article, a full 5% of those surveyed had met their significant other online.

Tinder remains among the most popular services in this space, despite the love-hate relationship many seem to have with it. According to Sean Rad, the CEO of Tinder, they use something they call the “Elo Score” to rate the desirability of all users. This explains how Tinder matches the beautiful people to each other, and also explains why people like me do so poorly there.

Looks matter a lot in the world of Tinder. Users "swipe" on each other based on profile pictures, rarely reading the provided bio until a match has already been made. In other words, Tinder is just as superficial as High School, just a whole lot more efficient. As far as machine learning is concerned, it might be argued that finding love on Tinder becomes a computer vision problem, with the mission of determining which users are the sexiest.

An Idealistic View of Love

What if we lived in a world where looks didn't matter? What if our goal was to find the best possible matches based on nothing more than personality? Although that may sound like science fiction, this is exactly what we are going to try to do.

Our Data

The analysis in this article is based on 59,946 user profiles taken from OKCupid in 2012. This data was released on GitHub by Adriana Escobedo-Land and Albert Y. Kim, and was the basis of OkCupid Data for Introductory Statistics and Data Science Courses, published in the Journal of Statistics Education. Permission to publish this data was granted by OkCupid president and co-founder Christian Rudder.

The raw data is here and all of the code used to create this model is available via our Github repository.

An Important Note On Ethics

For the purposes of this article, we used the official data set above, as opposed to scraping new, fresher data to use. In May of 2016 a group of Danish researchers led by Emil O. W. Kirkegaard were doing similar research, and released a dataset of 70,000 profiles online without any official permission to do so. Kirkegaard had this to say:

Some may object to the ethics of gathering and releasing this data. However, all the data found in the dataset are or were already publicly available, so releasing this dataset merely presents it in a more useful form.

However, in a a scathing rebuttal by Michael Zimmer he argues:

Even if someone knowingly shares a single piece of information, big data analysis can publicize and amplify it in a way the person never intended or agreed.

Perhaps Ian Malcolm said it best:

Ian Malcolm from Jurassic Park

Getting this data has become so easy that it is tempting to forget that privacy even exists anymore. Remember: even "anonymized" data can be problematic, as advancing machine learning methods become ever more adept at reconstructing these identifying traits from unrelated features.

Even if we don't publish it in academia, we still have access to it in industry. With great data comes great responsibility. Online stalking is a constant threat, and this is especially true of young women. In extreme cases, some very bad people have used data found in dating profiles to do horrible things, and it is easy to forget things like this when staring at data frames.

Simply stated: sound science demands sound ethics, and our industry is no exception.

Exploratory Data Analysis

For this article we focused on the 54,458 OKCupid profiles that contained bio text. Among these we find 32,648 men (60%) and 21,810 women (40%), ages ranging from 18 to 69 with similar distributions.

Age Distributions: Male v. Female

Of these 54,458 profiles, we find 46,672 identify as "straight" (85.7%), 5,188 identify as "gay" (9.5%), and 2,598 identify as "bisexual" (4.8%). Our goal is to match profiles based on personality alone, which means aside from gender and orientation, we are going to be focusing entirely on the contents of the profile texts. To start, we looked at the most common 20 bigrams appearing in the "essay0" section of the male and female profiles, and compared them each other:

Male Female
  • sense humor
  • easy going
  • born raised
  • open minded
  • years ago
  • meeting people
  • spend time
  • good time
  • east coast
  • just moved
  • video games
  • friends family
  • love travel
  • want know
  • spending time
  • year old
  • free time
  • work hard
  • live music
  • good food
  • sense humor
  • easy going
  • friends family
  • love travel
  • love laugh
  • meeting people
  • years ago
  • family friends
  • born raised
  • open minded
  • east coast
  • spend time
  • spending time
  • live music
  • fun loving
  • good time
  • make laugh
  • just moved
  • love outdoors
  • year old
  • Most of these bigrams appear in both groups, but those marked in red are unique to men or women. These differences between groups begins to show us how a matching algorithm might work. For example, the men like "video games" and "good food" while the women "love outdoors". Interestingly, although both groups value "sense [of] humor", women especially "love [to] laugh" and want someone who can "make [me] laugh".

    Straight Gay Bisexual
  • rock climbing
  • online dating
  • watching movies
  • love cook
  • hopeless romantic
  • uc berkeley
  • plink profile
  • open relationship
  • burning man
  • If we use the same method to group by orientation, we can see different personalities begin to emerge. We could of course segment further, for example, comparing top bigrams of straight males versus straight females. However, in order to arrive at a functional model we will need a more systematic approach.

    NLP and Dimensionality

    How does one perform math on words? We need to begin with a term-document matrix.

    Document "rock" "climbing" "rock climbing" ...
    Profile #1 1 1 1 ...
    Profile #2 0 0 0 ...

    Each row in this matrix corresponds to a document, in this case, a specific user's profile. Each column corresponds to a specific term. In between we have 1's (indicating that the document contains the term) and 0's (indicating it does not). From here we can use our usual linear algebra tricks.

    Note that bigrams such as "rock climbing" are considered a seperate term, unique from both "rock" and "climbing". There are roughly 70,000 unique words in the "essay0" text of these 54,458 profiles. Therefore the resulting matrix has roughly 70,000 columns (too large to fit on this screen), and is "fatter" or "wider" than it is "long" (it has more columns than rows).

    Term-Document Matrix Expansion

    The above graph appears somewhat logarithmic, and this makes sense. Given a finite number of unique words, eventually we would encounter them all, and our matrix would achieve its maximum width. From that point on our matrix would grow "longer" as new profiles added rows without the need for any additional columns.

    However, in the real world, each typo, each bit of slang, each misspelled word, is considered a unique term. Bigrams can easily expand your total vocabulary exponentially, as each unique pair of words appearing beside each other creates a new term, and new column.

    If the average profile has 100 unique words, and we have 70,000 unique words total, it stands to reason that 99.9% of each row will consist of nothing more than 0's. This is what we call a "sparse matrix", with vast areas of nothing between a few bits of precious data here and there. Although a comprehensive mathematical model of our text, it is far from the most efficient.

    Exploring Higher Dimensions

    What does it mean when we talk about the dimensionality of data? Consider the following example:

    User Feature X Feature Y
    User #1 1.8 2.2
    User #2 3.1 3.2
    User #3 4.2 5.1
    User #4 6.2 7.5
    User #5 9.1 8.1

    This is a classic 2-dimensional data set, and can be easily visualized in a 2D space:

    2 Dimensional Data

    Given a term-document matrix with 70,000 unique terms, we are dealing with a 70,000-dimensional data set, with 70,000 features to consider. Not only is this impossible to visualize in any meaningful way, any calculations become considerably more expensive as we add dimensions.

    2 Dimensional Data
    Mo' dimensions, mo' problems

    Principal Components Analysis

    Given the simplified 2D example above, could we somehow reduce it to a single dimension?

    2 Dimensional Data - correlation

    As we can see, these two features have a strong positive correlation, marked with the red line.

    2 Dimensional Data - rotation

    Imagine rotating the x-axis to follow this correlation, and centering our y-axis on this new mean.

    2 Dimensional Data - PCA output

    This is essentially what Principal Components Analysis does, combining our two features into a single new one. The majority of information can now be measured along the new x-axis alone, allowing us to effectively discard the y-axis. Victor Powell has created an amazing interactive PCA visualization tool to help better understand what PCA does.

    This process can be used recursively, combining 4 features into 2, and those 2 into 1. This allows us to reduce a ridiculous amount of dimensionality (e.g. 70,000 features) into something more sensible. This also has the added benefit of showing us which features are the most important, by finding where the majority of our useful information lies.

    There are some disadvantages also. We lose information as we lose dimensions; how much depends on the specific correlations of your data. More challenging: these component features are not easily interpretable. For example, if we reduce "age" and "height" into a new feature "ageheight", what does that number mean? How would you explain it to a client or customer? Taken to an extreme, your model can become a mysterious "black box" impossible to clearly explain.

    Non-Negative Matrix Factorization

    Like PCA, Non-Negative Matrix Factorization (NNMF) reduces the dimensionality of data. Unlike PCA, the component features it creates are more easily interpretable. For the purpose of our "love recommender" we are going to use NNMF to kill two birds with one stone: first we will reduce the massive dimensionality of our NLP, and second we will use the similarities among these reduced features to find good matches, that can be interpreted and explained.

    This 2-for-1 benefit of NNMF makes it an excellent candidate for deployment to production environments, where speed and computational cost are critical.

    NNMF versus PCA

    It is difficult to explain the specific differences between NNMF and PCA without delving deeper into linear algebra, which is beyond the scope of this article. However, DataCamp's Unsupervised Learning in Python class found a great way to visualize the difference, which I have reproduced below.

    We gave both algorithms samples from the LED Display Domain Data Set provided by UC Irvine. These are images taken from an LED display, and look like this:

    LED Display Domain Data Set samples

    When PCA reduces dimensionality, the resulting components don't make much sense:

    LED Display Domain Data Set samples: PCA components

    By comparison, NNMF recognizes that the LED display is made up of distinct pieces:

    LED Display Domain Data Set samples: NNMF components

    All letters and numbers can be represented on an LED by combining these components back together. In other words, both algorithms reduced the dimensionality of these images, but only NNMF reduced it into component parts that make sense. Even when we look at these components individually, we can interpret them, and infer how these pieces form the whole.

    Applied to our NLP problem, the NNMF algorithm will reduce our massive vocabulary into a small number of components that repesent the "key concepts" of these profiles. In addition to recommending users to each other, this also allows us to better understand the sorts of groups these different users fall into.

    Cosine Similarity

    Once we have reduced our term-document matrix to component "key concepts", we can represent all individual users as a vector. To see how similar any two users are, we simply do a little math:

    LED Display Domain Data Set samples
    Based on work by Oluwa Yetty and Matt Groening

    Recommending Love

    Before we play matchmaker, we need to complete our model. Before we were looking at bigrams based on count. Here we will use tf-idf instead, which looks at how important a given bigram is to a given profile. This works the same as our matrix of 0's and 1's, except with a tf-idf score for each term-document pair (e.g. 0.1234). We're also going to add some custom stop words, and ignore any bigram that appears less than 10 times. This results in a vocabulary of 28,071 bigrams. Once we have our tf-idf matrix, we fit it to our NNMF model, and reduce it to 20 components. We extract our needed information to a few data frames, and save everything in pickle files.

    Running on a MacBook Air with 4GB DDR and a 1.6 Ghz Intel Core i5, it takes 7 seconds to import and format the data from raw CSV, 24 seconds to create our tf-idf matrix, and 29 seconds to fit our NNMF model. Once we have these pkl files, we can suggest matches for any given user, from a cold start, in less than half a second. When it comes to production environments, given its speed and effeciency, NNMF is a great choice.

    Playing Matchmaker

    Our first contestant is a 31 year old straight female who has been matched to a 31 year old straight male. Take a look at the following two profiles. Does this look like a good match to you?

    #3971, Straight Female, Age 31 #42531, Straight Male, Age 31
    Fake Profile Pic i'm a lady living in the city with my dog. i tend to work too much, sleep too little and drink enough. i enjoy the occasional misadventures... drinks with friends, getting all decked out in sequins and seeing a drag show, dim sum sundays, taking the dog to the park, pumping a jukebox full of hall& oates, going to the beach and picking fights with the ocean via board (i lose) or just to lay in the sun and drink beer, throwing rocks from glass houses, trips to anywhere (usually a city where gambling is legal, even though i don't gamble but where there is gambling there are buffets) naps, watching zombie movies, saying the wrong thing, open bar (cash bar be damned!), cooking/baking for anybody &everybody, running, being cautiously optimistic, and occasionally indulging in a trashy gossip magazine.

    i am laughing often, sometimes clumsy, and always charming
    Fake Profile Pic i bake bread for a living and i like watching and following sports. i stay active enough to feel like i'm not out of shape. i'll do cardio at the mini gym at work because it's free, i'll ride my bike and i also like to play basketball. i like staying home to watch movies or surf the web. i read the newspaper a lot. i finished reading a novel for the first time in a year or so recently and i'm onto another book. i like being outside a bunch too, but no xtreame sports. i also like to go out for food and drink. i like trying new places. why else live in this city right?

    One of the major challenges with NNMF, as with all unsupervised machine learning algorithms, is that we don't have any way of calculating an objective accuracy rate. According to our model, these two users have a cosine similarity of 97%, but what does this mean in the real world?

    Results must be evaluated empirically. Once again data science becomes more art than science.

    Behind the scenes, this match looks like this:

    Component User #3971 User #42531
    0 0.645482 0.756530
    1 0.035856 0.019750
    2 0.075103 0.033638
    ... ... ...
    19 0.000000 0.000251

    The vectors for both of these users is based on their normalized tf-idf scores for each of the 20 components defined by NNMF. We can see both of these users have strong scores for component 0, and since NNMF creates interpretable components, we can dig deeper and view the bigrams associated with this component:

    Component 0 Bigrams
    • good food
    • live music
    • glass wine
    • enjoy cooking
    • food wine
    • trying restaurants
    • wine tasting
    • looking partner
    • exploring city

    Let's take a closer look at a few other matches to get a better understanding.

    Our Second Match

    Our second contestant is a 29 year old straight male who has been matched to a 21 year old bisexual female. Look at the profiles. What say you? Is cupid's arrow flying?

    #31700, Straight Male, Age 27 #4313, Bisexual Female, Age 21
    Fake Profile Pic i'm a geek of the "i find everything fascinating" fashion, as opposed to the "i write code" fashion, though i do that too. i live for good conversation, new experiences, and new ideas. i'm an ardent moderate. i put things together quickly. i'm very intelligent and somewhat humble. i like nature, but i'm not "outdoors-y". i'm a foodie. i can't dance, but i want to learn. i swear a lot. i'm not creative, but i appreciate art. i'm not 'fit', but i'm working on it. i normally don't write in sentence fragments. Fake Profile Pic i laugh at all the wrong moments. i say what i mean to my own detriment. i am queer, radical, and straight edge. i like art shows & urban exploration. i believe in herbal teas more than modern medicine. i also enjoy a little magical thinking from time to time.

    Although these users sound very different, both rank highly for component 15:

    Component 15 Bigrams
    • video games
    • board games
    • sci fi
    • interests art
    • interests dancing
    • hip hop
    • interests anime
    • plink profile
    • star wars

    It would appear that the girl who "laughs at all the wrong moments" has been put into the "geek" group. Is this a case of opposites attract, or does the computer see similarities here that our human brains are missing?

    Our Third Match

    Our third contestant is a 24 year old gay male who has been matched to a 27 year old gay male. How does this algorithm perform when given less options to match to?

    #7213, Gay Male, Age 24 #13418, Gay Male, Age 27
    Fake Profile Pic hi! i'm [name removed]!

    i was born in southern california and raised in kapalua, hawaii for 15 years and went to the oldest catholic grade school west of the mississippi. it was pretty neat! i loved my childhood...i live for hawaiian food and it's a miracle that i don't weigh like 300 pounds! i guess that i can thank my swimming & water polo career for that. i also taught sailing at the lahaina yacht club which definitely helped blossom my passion for the ocean and my love for sailing!

    i moved back to southern california to finish high school and was accepted to the united states naval academy at annapolis for college which was one of the hardest experiences of my life. i loved my navy experience and was able to meet some of the most amazing people... now on for the next part of my life's journey! :)
    Fake Profile Pic hey!

    here are 6 random tidbits about me:

    1. i have a lot of inside jokes that i'm the only person inside of. but i find myself making reference to them in conversations as if everyone knows what i'm talking about.

    2. when i was in 5th grade my elementary school had a "dress as your favorite black hero/shero day." i reminded my parents everyday for three weeks that i needed a costume, but of course they forgot to get one. but instead of doing the respectable thing and letting me stay home for the day, they pulled out an old dashiki from my father's closet and made me go as, and i quote, "general colin powell....at home."

    3. in the past 6 years i've been to about 30 countries (and i worked in china and hong kong for a couple years). but i scaled back when i started law school, which is probably a good thing: after a while every place looks the same.

    4. on every job i've had i've reached a day (or a month) when i've seriously considered calling in a bomb threat so that i wouldn't have to come into the office that day.

    5. when i was young i would try to give my allowance to the poor. only, i didn't know who was poor. so i would walk up and down the street stopping people, "excuse me, are you poor?" and, of course, i got cursed out hundreds of times. "f*** you!" "hell no!" "do i look like i'm poor?!" no good deed goes unpunished.

    6. politically, i'm very liberal (perhaps even radical) but personally i'm pretty conservative. for example, i think if people want to be in open relationships or experiment with drugs, they should be free to do so but i don't want either for myself. i like to think that i add a touch of blandness to a world full of flavor.

    Both of these users rank highly for component 8:

    Component 8 Bigrams
    • grad school
    • southern california
    • coast years
    • silicon valley
    • law school
    • originally southern
    • school finished
    • finish degree
    • started business

    Component 8 seems to focus on people who have moved to the bay area, and excelled in their education. There is a certain logic here, with a graduate from Annapolis being matched to someone who went to law school.

    Improving Our Model

    We used bigrams instead of single words throughout this article, however, in practice we would get better results by using both. Although this increases training time somewhat, once trained, we can still get recommendations in a fraction of a second.

    We can also refine these matches by creating more NNMF components. Instead of 20, we could try 50, or even 100. The more "key topics" we divide profiles into, the more specific the matches can become. We are only using a fraction of the available text data in these profile. Using more text, divided into more categories, will provide a more nuanced view of these users, at the cost of slower run times.

    You can easily modify these variables in the code and test the results for yourself.

    Conclusion

    Using our NNMF model to play matchmaker is a lot of fun, and we invite you to get the code and play with it yourself. Simply designate a user id and the script will output the top 3 matches. Note that the top match will always be the same user, as they are the best possible match to themselves. Assuming the top few matches do not correspond to the preferred gender and orientation, we can easily dig deeper into the results until we find someone who does.

    When it comes to making recommendations based on nothing but raw text, Non-Negative Matrix Factorization is hard to beat. It offers a powerful one-two punch, reducing the dimensionality of giant, sparse term-document matrices into manageable topics, and then suggests matches based on an incredibly efficient cosine similarity calculation.

    Given the speed and power of NNMF, it is an excellent tool to consider the next time your are deploying a recommendation engine to a production environment.

    Connect

    Did you find this article useful?

    Connect with us on social media and let us know:

    Get notified when DataJenius publishes:

    * indicates required