Imagine you were being interviewed for a data science or machine learning job, and the hiring manager showed you the following code, and asked you the following questions. Could you answer them?

```
# load dependencies
from sklearn import linear_model
import numpy as np
# define X and y
X = np.matrix([[66, 35, 75125],[63,22,26750],[74,65,11500],[67,25,135600],[70,55,110500]])
y = np.matrix([[113],[123],[140],[136],[145]])
# train our model
model = linear_model.LinearRegression()
model.fit(X,y)
# print our parameters
print("COEF:",model.coef_)
print("INTERCEPT:",model.intercept_)
```

Any programmer with a good work ethic, and solid grasp of algebra, can get decently far. However, as we advance in the field of machine learning, it becomes obvious that a mastery of mathematics and statistics is not only beneficial, but necessary.

There are two things I know to be true of every autodidact I have ever met:

- Autodidacts eventually learn what the word autodidact means
- Autodidacts who don't give up eventually become one of the best at whatever it is they are doing

If your education looks anything like mine, it consists of a jumbled mess of online courses, university, blog posts, work experience, forums, piles of certifications, endless Googling, a few brilliant peers, and the rest was copy and pasted directly from StackOverflow.

On the one hand, we autodidacts tend to be a rather fearless lot, plunging into white papers and beta code with reckless abandon. On the other hand, we sometimes find ourselves neck-deep in advanced materials before realizing we must have missed a necessary prerequisite somewhere along the way... the things we didn't know we didn't know...

If any of this sounds familiar to you, and you're ready to deepen your mathematical intuition regarding linear and logistic regression, welcome! This article was written especially for you.

The modern world of artificial intelligence lies at the crossroads of three major disciplines:

"Short Circuit", Tristar Pictures (1986)

I came to this world vis-a-vis computer science, by way of self-taught programmer. Over the past 20 years I have learned that it is far easier to copy and paste code from StackOverflow than to actually understand what this code is doing. When I first began reading white papers, I was absolutely *terrified* of mathematical notation. Weird shapes like $\sum$, $\theta$, $\partial$ and $\infty$ haunted my dreams, and as if that wasn't bad enough, even familiar variables like $y$ began wearing an odd assortment of hats and other accessories, like $\hat{y}$, $\bar{y}$, $\vec{y}$ or even $y_{2}^2$.

Even more confusing: different statisticians, computer scientists and applied mathematicians all prefer and use different notations. As a result, we will encounter many different formulas, with many different notations, all of which refer to the same ideas. We will also find these three groups all suck at naming things, resulting in a plethora of redundant and confusing labels for the same concepts. Yet another challenge: no matter the discipline, after writing pages upon pages of notation, most people make the occasional mistake, or leave out some assumed notation- obvious oversights and typos to those familiar with the notation, but confusing to those of us who are not.

Coming from a world in which a single typo can crash an entire program, I am often confused by the rough sketches of notation I find "in the wild". I don't think I am alone in this.

In this article we're going to begin with some algebra you probably first encountered somewhere around age 12, and we're going to follow it all the way through some basic statistics, a dash of calculus, and even some linear algebra. We are going to connect the dots between all these goofy math formulas and the code snippets we find online. We are going to examine, mathematically, what the above code is doing, and more importantly, why. The purpose of this article, above all else, is to help bridge the gaps between our three fields, so that we can better understand each other.

"Predator", Twentieth Century Fox (1987)

Along the way I will do my best to explain all the odd notation as we encounter it, and try to help you better recognize the same concepts despite these different notations. You will also find links to Khan Academy and other resources with expanded explanations of these concepts. In a few places I have skipped a few steps for the sake of brevity, but whenever I do, you will find a link to additional resources.

I assume those of you good at math have already scrolled down to the formulas by now. But for those of you still here, those still struggling to overcome their mathphobia, as I once did, I have one more thing to say to you:

In July 1998 I was a high school dropout.

In July 2008 I was a college dropout.

As of July 2018 I am currently in grad school.

I do not tell you this to impress you, but rather, because I stand as *living proof* that *anyone* can learn this stuff, so long as they are willing to put in the work. No matter which body you find yourself in, no matter where on Earth that body calls home, these tools are available to **you**. No one on this planet can stop you, except for yourself.

Two final things to remember before we dig in:

- Education is not a destination, it is a lifelong journey
- Anything worth doing is worth doing right

All are welcome to compete in the marketplace of ideas. May the best ideas win, everything else be damned.

You got this.

Now let's get to work...

Let's start with some junior high math; the formula for a line:

$$ y = mx + b $$The $y$ value on the left axis is calculated by taking our $x$ input, multiplying it by our slope $m$, and adding our y-intercept $b$. To offer a concrete example, below we have visualized $ y = \frac{1}{2}x + 2 $:

Some examples:

If $ x = 2 $ then $ y = \frac{1}{2}2 + 2 = 3 $.

If $ x = 3 $ then $ y = \frac{1}{2}3 + 2 = 3.5 $.

If $ x = 0 $ then $ y = \frac{1}{2}0 + 2 = 2 $.

Consider the following, where our $y$ axis represents weight in pounds and our $x$ axis represents height in inches:

If I told you a given person was 70 inches tall, how much would you guess they weigh?

Did you say a bit more than 135 pounds?

If so, congratulations! You just used a line to make a prediction, and this is what linear regression is all about.

$$ \hat{y} = predict(x) $$Our goal is to reduce the above line formula into a *predict function*. Given any height (in inches) as input ($x$), this function will return $ \hat{y} $ (y *hat*), our predicted weight in pounds.

Once we define our slope $m$ and y-intercept $b$, we can easily calculate $ \hat{y} $ for any value of $x$.

Until now, $ x $ and $ y $ have each represented a single number. However, the above graph is based on several observations, not just one. To accomodate this fact, let us define $ \vec{x} $ as a *vector* of heights in inches, and $ \vec{y} $ as a *vector* of corresponding weights in pounds.

Programmers familiar with lists, arrays or matrices will recognize the above syntax, and the concept is the same. A single variable $ x $ now represents a *series* of numbers. And in math, as in programming, we can reference a single element according to its *index*. For example, here is the height and weight of the first observation:

Remember: order matters. The first $ x $ should correspond to the first $ y $, and so on.

In order to calculate our regression line, and more importantly, understand how well it models the data, we must understand the residual, which is simply the difference between the actual value and our predicted value.

$$ residual_i = y_i - \hat{y_i} $$We can replace $ \hat{y} $ with the line formula from above:

$$ residual_i = y_i - (mx_i + b) $$The $ i $ subscript refers to the $ ith $ element of each vector. For example, our 3rd residual:

$$ residual_3 = y_3 - (mx_3 + b) $$How well does this line fit this data? It may be tempting to simply sum these residuals. However, that can cause a problem when these residuals cancel each other out. Consider the following:

$$ residual_2 = 123 - 120 = 3 $$ $$ residual_3 = 140 - 145 = -5 $$These two residuals sum to -2. That would suggest a "better than perfect" fit, which doesn't make any sense.

The sum of squared error resolves this issue by multiplying all residuals by themselves, which turns all negatives into positives.

$$ SSE = \sum_{i=1}^n (y_i - (mx_i + b))^2 $$This formula is the same as the above residual formula, squared. But what about this $ \sum $ doohicky? That's called a *sigma*. Allow me to translate this math formula directly into python:

```
sse = 0
for i in range(0,len(x)):
sse += (y[i]-((m*x[i])+b))**2
```

All this $ \sum $ means is to loop over all items in both of our equal length vectors ($\vec{x}$ and $\vec{y}$), and sum the total result.

Let's visualize what's going on here:

Each of these blue dots represents one of our ($ x_i, y_i $) pairs. The black line is our regression line, based on our slope $ m $ and y-intercept $ b $. The green lines show positive residuals, the red lines show negative residuals. The sum of squared error is simply the sum of all of these red and green lines, after we square them to remove the negatives.

But how do we define $ m $ and $ b $ in the first place?

We want the values of $ m $ and $ b $ that minimize our sum of squared error:

$$ SSE = \sum_{i=1}^n (y_i - (mx_i + b))^2 $$To find this, we begin by expanding this polynomial. Khan Academy offers an excellent step-by-step breakdown of the process for those who need it.

$$ SSE = \sum_{i=1}^n(y_i^2) - 2m(\sum_{i=1}^n(x_i*y_i)) - 2b(\sum_{i=1}^n y_i) + m^2(\sum_{i=1}^n x_i^2) + 2mb(\sum_{i=1}^n x_i) + nb^2 $$Given all these *sigmas*, this formula is starting to look pretty awful. Let's consider the first term:

Is there another way to derive this same number? Yes.

$$ \sum_{i=1}^n(y_i^2) = n(\bar{y^2})$$The sum of all $ y_i^2 $ is equal to $ n $ times the mean (average) value of $ y^2 $. Consider this in different notation:

$$ (y_1^2+y_2^2+...+y_n^2) = n(\frac{y_1^2+y_2^2+...+y_n^2}{n})$$Replacing these sums with $ n $ times their equivalent means allows us to go from this:

$$ SSE = \sum_{i=1}^n(y_i^2) - 2m(\sum_{i=1}^n(x_i*y_i)) - 2b(\sum_{i=1}^n y_i) + m^2(\sum_{i=1}^n x_i^2) + 2mb(\sum_{i=1}^n x_i) + nb^2 $$To this:

$$ SSE = n(\bar{y^2}) - 2mn(\bar{xy}) - 2bn(\bar{y}) + m^2n(\bar{x^2}) + 2mbn(\bar{x}) + nb^2 $$From here we need to delve into a bit of calculus. Specifically, we need to find the *partial derivatives* ($ \frac{\partial SSE}{\partial m} $ and $ \frac{\partial SSE}{\partial b} $) and set them to 0. Khan Academy offers complete video instructions for those who need a refresher, or are new to multivariate calculus.

The *partial derivative* ($ \frac{\partial SSE}{\partial m} $) uses the same rules of a regular derivative ($\frac{dSSE}{dm} $), except everything aside from our target variable is treated as a constant. For example, when calculating the *partial derivative* of $ SSE $ with respect to $ m $ we treat $ b $, $ n $, $ x $ & $ y $ as constants. The derivative of a constant is always 0, which has been included below to make these equations more readable:

If the above calculus confused you, good news! From here we just need some more algebra.

"Futurama", FOX Broadcasting

We divide the *partial derivatives* by $ 2n $ and drop the 0's:

If we rearrange terms, this starts to resemble the line formula again. Let's start with the second *partial derivative*:

We can do the same with the other *partial derivative*:

We can now solve for $ m $ by treating these as a system of equations, and subtracting the bottom from the top:

$$ \begin{alignat*}{5} & & \bar{y} & = & m(\bar{x}) & + b \\ & - & \frac{\bar{xy}}{\bar{x}} & = & m(\frac{\bar{x^2}}{\bar{x}}) & + b \\ \hline & & \bar{y}-\frac{\bar{xy}}{\bar{x}} & = & m(\bar{x}-\frac{\bar{x^2}}{\bar{x}}) & \\ \end{alignat*} $$Now we solve for $ m $ by dividing $ (\bar{x}-\frac{\bar{x^2}}{\bar{x}}) $ from both sides:

$$ m = \frac{\bar{y}-\frac{\bar{xy}}{\bar{x}}}{\bar{x}-\frac{\bar{x^2}}{\bar{x}}} $$Clean it up by multiplying both sides by 1 in the form of $ \frac{\bar{x}}{\bar{x}} $:

$$ m = \frac{(\bar{x})(\bar{y})-(\bar{xy})}{(\bar{x})^2-(\bar{x^2})} $$Once we have $ m $, finding $ b $ is considerably easier:

$$ \bar{y} = m(\bar{x}) + b $$Subtract $ m(\bar{x}) $ from both sides:

$$ \bar{y} - m(\bar{x}) = b $$Rearrange terms:

$$ b = \bar{y} - m(\bar{x}) $$Now that we have the formulas to find $ m $ and $ b $, we can do regression with some more junior high math:

$$ m = \frac{(\bar{x})(\bar{y})-(\bar{xy})}{(\bar{x})^2-(\bar{x^2})} $$Terms with a bar over top are just *means* (averages).

What about this $(\bar{xy})$ doohicky?

$$ \vec{y} = [113,123,140,136,145] $$ $$ (\bar{xy}) = \frac{((66*113)+(63*123)+(74*140)+(67*136)+(70*145))}{5} $$Note that the bottom two terms are different:

$$ (\bar{x})^2 = (\bar{x}*\bar{x}) = (\frac{66+63+74+67+70}{5})^2 $$ $$ (\bar{x^2}) = \frac{(66^2+63^2+74^2+67^2+70^2)}{5} $$Once we find all of our means we can just pop them into the formula:

$$ m = \frac{(\bar{x})(\bar{y})-(\bar{xy})}{(\bar{x})^2-(\bar{x^2})} \approx 2.19 $$Note the squiggly equal sign ($\approx$). This means an *approximate value* (since I am rounding to 2.19 here).

We find $ b $ with more substitution:

$$ b = \bar{y} - m(\bar{x}) $$ $$ b = \bar{y} - 2.19(\bar{x}) $$ $$ b \approx -17 $$And now we have our prediction function:

$$ \hat{y} = mx + b $$ $$ \hat{y} = 2.19x - 17 $$Given an input of 70 inches, it returns a prediction of about 136 pounds:

$$ 136 \approx 2.19(70) - 17 $$Let's pause for a moment and connect the above mathematics to some key terms used in the field of machine learning.

**The Model**

*aka univariate linear regression*

This **model** accepts a single **feature** ($x$) and uses the **parameters** ($m$ and $b$) to calculate our predicted value ($\hat{y}$).

**The Cost Function**

*aka the error function, squared error, sum of squared error, sum of squares*

This **cost function** sums the squared difference between the actual value ($y$) and predicted value ($\hat{y}$) across all single-variable observations in the **vector** ($ \vec{x} $), for any given **parameters** ($m$ and $b$).

**The Method for Estimating Parameters**

*aka fitting the model, training the model, optimizing our parameters*

We found the $m$ and $b$ **parameters** with the smallest total cost (least error) by taking the **partial derivative** of our **cost function** with respect to $b$, and with respect to $m$, and setting both equal to 0. In machine learning it is common (but not necessary) to use **iterative** and **stochastic** methods instead (e.g. **gradient descent**).

All of the above math breaks into three distinct components.

"The Lego Movie", Warner Bros. (2014)

For example, we could replace our linear regression **model** with an SVM or kNN. And instead of using our deterministic method to **estimate our parameters**, we could use gradient descent instead. And no matter which method or algorithm we use to **optimize parameters**, it will need a **cost function** to determine how well any given **parameters** fit any given data once entered into our selected **model**.

In the previous example we used height (our independent, $x$ variable) to predict weight (our dependent, $y$ variable). In this example, in addition to height, we will add the corresponding ages and incomes for each observation. This same math can extend to any number of features, which is why we call it *multivariate*.

Recall our single-variable model:

$$ \hat{y} = mx + b $$We can imagine the multi-variable model like so:

$$ \hat{y} = m_1x_1 + m_2x_2 + m_3x_3 + b $$Each of our $ x $ features is multiplied by a corresponding slope $ m $, and these terms are summed together with $ b $.

Although the concept is the same, the python sklearn documentation has it as:

$$ \hat{y}(\vec{w},\vec{x}) = w_0 + w_1x_1 + w_2x_2 + w_3x_3 + ... + w_px_p$$In this notation, $ \vec{w} $ and $ \vec{x} $ are both vectors, and unlike our single-variable notation, each $ \vec{x} $ vector represents all features from a single observation. Consider this in table form:

Observation | $ x_1 $ (height) | $ x_2 $ (age) | $ x_3 $ (income) |
---|---|---|---|

1 | 66 | 35 | 75125 |

2 | 63 | 22 | 26750 |

... | ... | ... | ... |

5 | 70 | 55 | 110500 |

To calculate $ \hat{y} $ (our predicted value) for our first observation (row 1) we provide an $ \vec{x} $ vector like so:

$$ \vec{x} = [66,35,75125] $$We can plug this into the above formula:

$$ \hat{y} = w_0 + w_166 + w_235 + w_375125 $$What about our $ \vec{w} $ vector? It begins with our intercept ($ b $ in the single-variable formula is the same as $w_0$ here). The rest of our $ \vec{w} $ vector is comprised of the coefficients (slopes) for each of our three $ x $ features. In the single-variable example above our mission was to define $ m $ and $ b $; here our mission is to define our $ \vec{w} $ vector instead.

As with the single-variable example above, we will use a deterministic method to optimizize our parameters. Specifically, we will use the **least squares** method, which is the same default approach used by the LinearRegression class from the python sklearn package and lm function from the R stats package.

In order to calculate least squares, we use linear algebra, which means changing up our notation yet again:

$$ \vec{y} = X\vec{b} + \vec{\epsilon} $$This still resembles the basic line formula, except $\vec{y}$, $\vec{b}$, and $\vec{\epsilon}$ are vectors, and capital $X$ is a matrix of all features. In this notation, $\vec{b}$ holds our coefficients ($\vec{w}$ above) and $\vec{\epsilon}$ (epsilon) holds our errors.

If we expand this formula it looks like this:

$$ \begin{bmatrix} y_{1} \\ y_{2} \\ y_{3} \\ y_{4} \\ y_{5} \end{bmatrix} = \begin{bmatrix} x_{1,1} & x_{1,2} & x_{1,3} \\ x_{2,1} & x_{2,2} & x_{2,3} \\ x_{3,1} & x_{3,2} & x_{3,3} \\ x_{4,1} & x_{4,2} & x_{4,3} \\ x_{5,1} & x_{5,2} & x_{5,3} \end{bmatrix} * \begin{bmatrix} b_{1} \\ b_{2} \\ b_{3} \\ b_{4} \\ b_{5} \end{bmatrix} + \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \epsilon_{3} \\ \epsilon_{4} \\ \epsilon_{5} \end{bmatrix} $$You will notice that the items in the $X$ matrix each have two indexes; the first refers to the row number, the second to the feature number. In this example: $x_{1,1}$ is the height of the 1st observation, $x_{3,2}$ is the age of the 3rd observation, and $x_{5,3}$ is the income of the 5th observation. Recall the above table:

Observation | $ x_1 $ (height) | $ x_2 $ (age) | $ x_3 $ (income) |
---|---|---|---|

1 | 66 | 35 | 75125 |

... | ... | ... | ... |

5 | 70 | 55 | 110500 |

As before, our $\vec{y}$ vector contains the dependent variable (the thing we are trying to predict), in this example, weight in pounds. The first column of the $X$ matrix is the same as the $\vec{x}$ vector of the single-variable example; here it is joined by the other two features in parallel columns. To make this crystal clear, here is the same expanded formula with known values filled in:

$$ \begin{bmatrix} 113 \\ 123 \\ 140 \\ 136 \\ 145 \end{bmatrix} = \begin{bmatrix} 66 & 35 & 75125 \\ 63 & 22 & 26750 \\ 74 & 65 & 11500 \\ 67 & 25 & 135600 \\ 70 & 55 & 110500 \end{bmatrix} * \begin{bmatrix} b_{1} \\ b_{2} \\ b_{3} \\ b_{4} \\ b_{5} \end{bmatrix} + \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \epsilon_{3} \\ \epsilon_{4} \\ \epsilon_{5} \end{bmatrix} $$Key point: we can accommodate more observations by adding more rows to this formula, and accommodate more features by adding more columns to our $X$ matrix. This can be expanded to any size we wish; this example uses 5 observations and 3 features. And just as with regular algebra, and showboating globetrotter algebra, linear algebra allows us to manipulate equations based on the relationships between unknown variables to extract new insights.

As with the single-variable example, we will use the sum of squared error here too. Given that our linear algebra formula contains a specific error term ($\epsilon$), we just need to square and sum it:

$$ SSE = \sum_{i=1}^n \epsilon_i^2 $$However, due to the rules of matrix multiplication, we cannot multiply the $\epsilon$ vector by itself until we *transponse* it. For those who need it, Khan Academy offers an excellent lesson on matrix multiplication. I also find this matrix calculator handy for checking my work. Key point: in linear algebra, the shapes of matrices and vertices matters a lot.

Recall that in linear algebra notation our $\epsilon$ vector looks like this:

$$ \vec{\epsilon} = \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \epsilon_{3} \\ \epsilon_{4} \\ \epsilon_{5} \end{bmatrix} $$We denote *transpose* with a superscript capital T ($\vec{\epsilon}^T$), or sometimes with a little dash like so ($\vec{\epsilon}'$):

Ergo:

$$ \vec{\epsilon}^2 = \vec{\epsilon}^T\vec{\epsilon} = \begin{bmatrix} \epsilon_{1} & \epsilon_{2} & \epsilon_{3} & \epsilon_{4} & \epsilon_{5} \\ \end{bmatrix} * \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \epsilon_{3} \\ \epsilon_{4} \\ \epsilon_{5} \end{bmatrix} $$Or to put it another way:

$$ SSE = \sum_{i=1}^n \epsilon_i^2 = \vec{\epsilon}^T\vec{\epsilon} = \begin{bmatrix} \epsilon_{1} & \epsilon_{2} & \epsilon_{3} & \epsilon_{4} & \epsilon_{5} \\ \end{bmatrix} * \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \epsilon_{3} \\ \epsilon_{4} \\ \epsilon_{5} \end{bmatrix} $$Different notation, different rules of multiplication, but the same concept.

Nothing to be afraid of.

"The Simpsons" via Numberphile / YouTube / FOX Broadcasting

Aside from the details of linear algebra, the logic is essentially the same here as with our single-variable example.

Start with the formula:

$$ \vec{y} = X\vec{b} + \vec{\epsilon} $$Solve for $\vec{\epsilon}$:

$$ \vec{\epsilon} = \vec{y} - X\vec{b}$$Square it (*transpose*):

We distribute the *transpose* inside the parenthesis:

From here we can expand like any other polynomial:

$$ SSE = \vec{y}^T\vec{y} - \vec{b}^TX^T\vec{y} -\vec{y}^TX\vec{b} + \vec{b}^TX^TX\vec{b} $$I am going to skip a few steps for brevity. See this post from Economic Theory Blog for details or page 7 of this paper by Andrew and Constable for a full proof. Key point: we use a few linear algebra tricks to simplify the equation further.

$$ SSE = \vec{y}^T\vec{y} = 2\vec{b}^TX^T\vec(y)+\vec{b}^TX^TX\vec{b} $$Again our goal is to minimize our cost function, so again we need to find the partial derivative and set it to 0, this time with respect to $b$.

$$ \frac{\partial SSE}{\partial b} = -2X^T\vec{y} + 2X^TX\vec{b} = 0 $$And from here we simplify further and solve for $b$:

$$ \vec{b} = (X^TX)^{-1}X^T\vec{y} $$Now that we have a formula for $\vec{b}$, we can get our coefficients (slopes) and our intercept ($w_0$).

$$ \vec{b} = (X^TX)^{-1}X^T\vec{y} $$Recall our $\vec{y}$ vector of weight (in pounds):

$$ \vec{y} = \begin{bmatrix} 113 \\ 123 \\ 140 \\ 136 \\ 145 \end{bmatrix} $$Recall our $X$ matrix of three features: height (in inches), age (in years), and income (in dollars):

$$ X = \begin{bmatrix} 66 & 35 & 75125 \\ 63 & 22 & 26750 \\ 74 & 65 & 11500 \\ 67 & 25 & 135600 \\ 70 & 55 & 110500 \end{bmatrix}$$In order to calculate our intercept ($w_0$) we will add a column of 1's to our $X$ matrix:

$$ X = \begin{bmatrix} 1 & 66 & 35 & 75125 \\ 1 & 63 & 22 & 26750 \\ 1 & 74 & 65 & 11500 \\ 1 & 67 & 25 & 135600 \\ 1 & 70 & 55 & 110500 \end{bmatrix}$$In order to express the first term of our formula, we must *transpose* $X$:

Now we can multiply:

$$ X^TX = \begin{bmatrix} 5 & 340 & 202 & 359475 \\ 340 & 23190 & 14031 & 24314700 \\ 202 & 14031 & 9584 & 13432875 \\ 359475 & 24314700 & 13432875 & 37089188125 \end{bmatrix} $$Now what about this ($^{-1}$) doohicky? This means we need the *inverse* of this new combined matrix ($X^TX$). Working out the *inverse* of larger matrices can get tricky. Khan Academy has a complete video series on the subject. Numbers are rounded here for cleaner notation:

Our second term:

$$ X^T\vec{y} = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 66 & 63 & 74 & 67 & 70 \\ 35 & 22 & 65 & 25 & 55 \\ 75125 & 26750 & 11500 & 135600 & 110500 \end{bmatrix} * \begin{bmatrix} 113 \\ 123 \\ 140 \\ 136 \\ 145 \end{bmatrix} $$More matrix multiplication:

$$ X^T\vec{y} = \begin{bmatrix} 657 \\ 44829 \\ 27136 \\ 47853475 \end{bmatrix} $$Now we multiply terms together:

$$ (X^TX)^{-1}X^T\vec{y} = \begin{bmatrix} 434.7838406064 & -7.36983664257 & 1.5438829428 & 0.0000583096 \\ -7.3698364247 & 0.1256055225 & -0.0269254232 & -0.0000011621 \\ 1.5438829428 & -0.0269254232 & 0.0065308471 & 0.0000003227 \\ 0.0000583096 & -0.0000011621 & 0.0000003227 & 0.0000000001 \end{bmatrix} * \begin{bmatrix} 657 \\ 44829 \\ 27136 \\ 47853475 \end{bmatrix} $$And violà! Here is our $\vec{b}$ vector (rounded):

$$ \vec{b} = (X^TX)^{-1}X^T\vec{y} = \begin{bmatrix} -44.29148 \\ 2.52675 \\ -0.04592 \\ 0.00008 \end{bmatrix}$$The first item in our $\vec{b}$ vector is the intercept ($w_0$). The rest of the $\vec{b}$ vector contains our coefficients ($w_1, w_2, w_3 $).

Recall the above model formula:

$$ \hat{y} = w_0 + w_1x_1 + w_2x_2 + w_3x_3 $$Substitute our intercept ($w_0$) and coefficients:

$$ \hat{y} = −44.29148 + 2.52675(x_1) −0.04592(x_2) + 0.00008(x_3) $$Let us predict the weight (in pounds) of a person with a height of 66 inches, age of 35 years, and income of $75,125:

$$ \hat{y} = −44.29148 + 2.52675(66) −0.04592(35) + 0.00008(75125) $$This new multivariate linear regression model returns a prediction of roughly 127 pounds.

$$ 127 \approx −44.29148 + 2.52675(66) −0.04592(35) + 0.00008(75125) $$Just to make everything crystal clear:

**The Model**

*aka multivariate linear regression*

This **model** accepts a **features vector** ($\vec{x}$) and uses a **parameter vector** ($\vec{w}$) to calculate our predicted value ($\hat{y}$) for any single observation (row of data).

**The Cost Function**

*aka the error function, squared error, sum of squared error, sum of squares*

This **cost function** sums our squared error ($\vec{\epsilon}^2$) across all observations in the **feature matrix** ($ X $), for any given **parameter vector** ($ \vec{b} $) using linear algebra.

**The Method for Estimating Parameters**

*aka fitting the model, training the model, optimizing our parameters, least squares method, ordinary least squares*

We used the **least squares method** to minimize our **cost function** and find the **parameter vector** with the smallest total cost (least error) by taking the **partial derivative** of our **cost function** with respect to $b$ and setting it equal to 0. Once again we used a **deterministic** method as opposed to an **iterative** and **stochastic** method like **gradient descent**.

Now that we've paid our dues, let's make python do this math for us:

```
# load dependencies
from sklearn import linear_model
import numpy as np
# define X and y and look at them
X = np.matrix([[66, 35, 75125],[63,22,26750],[74,65,11500],[67,25,135600],[70,55,110500]])
y = np.matrix([[113],[123],[140],[136],[145]])
print(X)
print(y)
# train our model
model = linear_model.LinearRegression()
model.fit(X,y)
# we get the same parameters
print("COEF:",model.coef_)
print("INTERCEPT:",model.intercept_)
# we get the same prediction
X_input = np.matrix([[66,35,75125]])
prediction = model.predict(X_input)
print(prediction)
```

We get the same parameters:

COEF: [[ 2.52674720e+00 -4.59219855e-02 7.96706668e-05]] INTERCEPT: [-44.29148412]

We get the same prediction:

[[126.85182057]]

And now we can do it with pencil and paper.

Assuming we lose power during the inevitable robot apocalypse, that may come in handy.

We've covered a lot of math so far, but what about $\infty$, $e$, and the rules of negative exponents?

First, let's take a step back and consider why we should even want to use *logistic* regression in the first place.

In the *linear* regression examples above, our prediction ($\hat{y}$) is a *continuous* value (weight in pounds), as opposed to a *discrete* (aka *categorical*) value. In *logistic* regression, our goal is to predict a *discrete* value: which *group* an observation belongs to. To better understand the difference, imagine if we divided our 5 observations into two groups: those who love broccoli (marked with 1), and those who don't (marked with 0). Consider in table form:

Observation | $ x_1 $ (height) | $ x_2 $ (age) | $ x_3 $ (income) | $ y $ (love broccoli) |
---|---|---|---|---|

1 | 66 | 35 | 75125 | 1 |

2 | 63 | 22 | 26750 | 1 |

... | ... | ... | ... | ... |

5 | 70 | 55 | 110500 | 0 |

According to this table data, observation 1 and 2 love broccoli, and observation 5 does not.

Medical News Today

Before we had a vector of weights in pounds:

$$ \vec{y} = [113,123,140,136,145] $$And we predicted a *continuous* weight in pounds:

Now we have a boolean vector (0's and 1's) representing love of broccoli:

$$ \vec{y} = [1,1,0,1,0] $$And we want to predict which of these two groups (0 or 1) an observation belongs to.

$$ \hat{y} = 0 \hspace{1cm}\text{ or }\hspace{1cm} \hat{y} = 1 $$In mathematics, we can define these *discrete* values like so:

Good news! If you've been paying attention so far, this model should look familiar:

$$ \hat{y}(\vec{w},\vec{x}) = g(w_0 + w_1x_1 + w_2x_2 + w_3x_3 + ... + w_px_p)$$This is

1. Everything is wrapped in a $g()$ function.

2. The output ($\hat{y}$) as defined here is not yet in proper 0 or 1 format.

$$ \hat{y} \neq y \in \{0, 1\} $$Let's look at these issues one at a time.

That $g()$ function above? That's the *logistic function*, also known as the *sigmoid function*, which is our *activation function* for this model. As you might have guessed, this is where *logistic* regression gets its name. Whatever you call it, this function looks like this:

This formula contains ($e^{-z}$) which is a tricky term if you've never seen something like this before. This is *euler's constant* to a *negative power*. And right about now you might be asking: *what the hell is a negative power*? Don't panic.

When $z$ is negative, $-z$ switches back to positive. Recall how positive powers work:

$$ 2^2=4 \hspace{1cm} 2^3 = 8 \hspace{1cm} 3^3 = 27 $$Negative powers work like so:

$$ 2^{-2}=\frac{1}{4} \hspace{1cm} 2^{-3}=\frac{1}{8} \hspace{1cm} 3^{-3}=\frac{1}{27} $$Easy enough. Now, what about this $e$ doohicky?

Have you ever seen little kids arguing over which is the *biggest number*? Usually one clever kid will exclaim: "infinity!", only to be outwitted by the first to shout "infinity plus one!". Well, I've got some bad news, kids.

"Infinity plus one" doesn't mean anything, mathematically speaking, because $\infty$ is a *limit*, not a *number*.

Who cares? What does that have to do with $e$?

$$ e = \mathop {\lim }\limits_{n \to \infty } \left( {1 + \frac{1}{n}} \right)^n $$If you are unfamiliar with calculus, the above may cause a panic attack. That's normal. Take a deep breath. All this means is that $e$ is equal to the *limit* of this function as $n$ approaches $\infty$. What does this mean?

Let $n = 5$ for the sake of example:

$$ ({1 + \frac{1}{5}})^5 \approx 2.48 $$Let $n = 50$ for the sake of example:

$$ ({1 + \frac{1}{50}})^{50} \approx 2.69 $$Let $n = 500$ for the sake of example:

$$ ({1 + \frac{1}{500}})^{500} \approx 2.71 $$Let $n = 5000$ for the sake of example:

$$ ({1 + \frac{1}{5000}})^{5000} \approx 2.72 $$Now imagine if we continued increasing $n$ towards $\infty$. Of course, we will never actually *reach* $\infty$, because $\infty$ isn't a number. But in some cases, like this one, we will *converge* towards a specific number.

As we continue to increase $n$ to 5000 and beyond, we will continue to gain precision to a greater number of decimal points. In reality, $e$ just keeps on going and going, just like $\pi$. At a certain point we just use a given decimal value ($\pi \approx 3.14$) because it's "good enough" for a solid approximation ($e \approx 2.72$).

Now that we understand $e$ and $\infty$, we can take another look at our model:

$$ \hat{y}(\vec{w},\vec{x}) = g(w_0 + w_1x_1 + w_2x_2 + w_3x_3 + ... + w_px_p)$$Specifically, let's consider this part:

$$w_0 + w_1x_1 + w_2x_2 + w_3x_3$$Given that $x_1$, $x_2$ and $x_3$ might be *any number*, in theory this calculation could return anything from $-\infty$ to $\infty$. And *whatever* it returns will go into our logistic function as $z$.

Substitute $e$ with our decimal approximation:

$$ g(z) \approx \frac{1}{1+2.72^{-z}} $$Let $z = 5$ for the sake of example:

$$ g(5) \approx \frac{1}{1+2.72^{-5}} $$Don't let the negative exponent stop you:

$$ g(5) \approx \frac{1}{1+\frac{1}{2.72^5}} $$Find $2.72$ to the $5th$ power:

$$ g(5) \approx \frac{1}{1+\frac{1}{148.882797363}} $$Convert the 1 in the denominator to another fraction/decimal:

$$ g(5) \approx \frac{1}{\frac{148.882797363}{148.882797363}+\frac{1}{148.882797363}} $$Add the fractions in the denominator:

$$ g(5) \approx \frac{1}{\frac{149.882797363}{148.882797363}} $$Evaluate the fraction in the denominator via division to convert to a decimal:

$$ g(5) \approx \frac{1}{1.00671669271} $$Divide our numerator by our denominator:

$$ g(5) \approx 0.993... $$Long story short, the logistic function *converges* to 1 as it approaches $\infty$ *from the left*:

It also *converges* to 0 as it approaches $-\infty$ *from the right*:

Let's visualize what that means:

Note the distinct "S" shape, hence why this is sometimes called the *sigmoid* function. We can see this takes *any* input and "squishes" it down to an output between 0 and 1. And herein lies the magic of *logistic* regression.

Let's return to python for a moment:

```
# load dependencies
from sklearn import linear_model
import numpy as np
# define X and y
X = np.matrix([[66, 35, 75125],[63,22,26750],[74,65,11500],[67,25,135600],[70,55,110500]])
y = [1,1,0,1,0]
# train our model
model = linear_model.LogisticRegression()
model.fit(X,y)
# print our parameters
print("COEF:",model.coef_)
print("INTERCEPT:",model.intercept_)
```

According to the documentation, the above code uses the liblinear package to estimate our parameters (our needed $w$ vector). By default, this relies on coordinate descent, which is an *iterative* algorithm. This means it must work through a calculation again and again to get to the right answer. Explaining this math in the same level of detail deserves an article of its own. In the meantime I will direct you to Andrew Ng's lecture on the subject from his Coursera class on machine learning.

For now, let's focus on the model itself, and let python optimize the parameters for us:

$$ w_1, w_2, w_3$$COEF: [[ 2.12455983e-01 -3.25565862e-01 2.69547674e-07]]

INTERCEPT: [0.00471522]

Notice how small these coefficients are (rounded):

$$ \vec{w} = \begin{bmatrix} 0.00471522 \\ 0.21245598 \\ -0.3255659 \\ 0.00000027 \end{bmatrix} $$If we take another peek at the sigmoid graph this makes sense:

If we follow the $z$ input on the above graph, we can see that any value below $-5$, or over $5$, is overkill for the logistic function. These small coefficients take our raw feature input $\vec{x}$ and calculates something more reasonable given this $-5$ to $5$ range.

Let's go back to python and make a prediction based on our first observation:

```
# make a prediction
X_input = np.matrix([[66,35,75125]])
prediction = model.predict(X_input)
print("PRED:",prediction)
# what is our probability?
prob = model.predict_proba(X_input)
print("PROB:",prob)
```

What does our model think?

PRED: [1] PROB: [[0.06585018 0.93414982]]

Our model says this person loves broccoli. But where did this probability come from?

Recall our model formula:

$$ \hat{y}(\vec{w},\vec{x}) = g(w_0 + w_1x_1 + w_2x_2 + w_3x_3 + ... + w_px_p)$$Substitute the parameters vector $\vec{w}$ with intercept and coefficients:

$$ \hat{y}(\vec{x}) = g(0.00471522 + 0.21245598(x_1) -0.3255659(x_2) + 0.00000027(x_3))$$Substitute the features vector $\vec{x}$ with all input from our first observation:

$$ \hat{y} = g(0.00471522 + 0.21245598(66) -0.3255659(35) + 0.00000027(75125))$$Evaluate:

$$ \hat{y} = g(2.6522871499999967)$$Express the logistic function:

$$ \hat{y} = (\frac{1}{1+e^{-2.6522871499999967}}) $$Evaluate using the same steps from above:

$$ \hat{y} = 0.9341... $$What we have here is the probability of this person loving broccoli. And because total probability always sums to $1$, we know that the probability of NOT loving broccoli is just ($1-0.9341...=0.0659...)$. More formally we can say:

$$ \hat{y}_1 = p(y=1|\vec{w},\vec{x}) = 0.93414982 $$In words: "y-hat one equals the probability that y equals one given this parameters vector w and features vector x".

And knowing total probabilities always sum to 1, we can deduce:

$$ \hat{y}_0 = 1-p(y=1|\vec{w},\vec{x}) = 0.06585018 $$Compare to what python gave us:

PROB: [[0.06585018 0.93414982]]

Python returns a list of all probabilities in order of class. In other words, class 0 (NOT "love broccoli") has a 0.06585018 probability. Class 1 ("love broccoli") has a 0.93414982 probability. And based on these probabilities, using whichever is the highest, we can return a prediction in our desired form:

$$ y \in \{0, 1\} $$And thus we turn our original linear model into a model capable of classification, and even better, a model capable of providing a probability estimate for each of those classifications.

In the above example, we measured a person's opinion of broccoli with a boolean variable. In the real world, most opinions are far more nuanced than this. Does logistic regression force us to reduce everything into such simple terms? Nope. Multiclass logistic regression allows us to divide observations into any number of classes (groups) we want.

Imagine we wanted to group our observations into those who love broccoli (1), hate broccoli (0), or are indifferent to broccoli (2).

$$ y \in \{0, 1, 2\} $$Our $\vec{y}$ vector might look like this:

$$ \vec{y} = [1,2,2,1,0] $$Those familiar with Futurama may notice a problem here:

Sadly, in our world, there is such a thing as 2, to say nothing of 3, 4, and 5. What if we defined groups like this?

$$ y \in \{0, 1, 2, 3, 4, 5, 6, 7\} $$Won't these larger numbers mess up all our math from earlier? Let's find out.

Let's return to python for a moment:

```
# load dependencies
from sklearn import linear_model
import numpy as np
# define X and y
X = np.matrix([[66, 35, 75125],[63,22,26750],[74,65,11500],[67,25,135600],[70,55,110500]])
y = [1,2,2,1,0]
# train our model
model = linear_model.LogisticRegression()
model.fit(X,y)
# print our parameters
print("COEF:",model.coef_)
print("INTERCEPT:",model.intercept_)
```

This code is **identical** to the ($0, 1$) example, except we have changed $\vec{y}$ to include 2's here.

What do we get when we run it?

COEF: [[-3.14850877e-01 2.89620012e-01 7.96234349e-05] [ 8.87203204e-02 -7.16339822e-01 2.78939204e-04] [ 1.03631441e-01 4.03295133e-02 -1.64381858e-04]] INTERCEPT: [-0.00533213 0.00357773 0.00162194]

Now we have 3 sets of coefficients, and 3 intercepts. What is going on here?

Let's predict for our first observation (still labeled as a 1, "love broccoli"):

```
# make a prediction
X_input = np.matrix([[66,35,75125]])
prediction = model.predict(X_input)
print("PRED:",prediction)
# what is our probability?
prob = model.predict_proba(X_input)
print("PROB:",prob)
```

This output looks more familiar:

PRED: [1] PROB: [[0.01062205 0.97071551 0.01866244]]

We predict this observation has a 0.01062205 probability for class 0 ("hate broccoli"), a 0.97071551 probability for class 1 ("love broccoli"), and a 0.01866244 probability for class 2 ("indifferent to broccoli").

Our final prediction is that they belong to group 1 ("love broccoli").

But where are all these numbers coming from?

As above, we will leave estimating the parameters to python (for the same reasons), and focus on the model.

The key to multiclass logistic regression lies in *one v. rest*, sometimes called *one v. all*. The basic idea is this: instead of training one model, we just trained three. Each use the same formula as the boolean ($0, 1$) version:

Originally we were asking:

$$ \text{Does this belong to group 0, 1 or 2?} $$We have altered this into three yes or no (boolean) questions instead:

$$ \text{Does this belong to group 0?} $$ $$ \text{Does this belong to group 1?} $$ $$ \text{Does this belong to group 2?} $$By rephrasing our question into something boolean, we can reuse our prior work.

If you would prefer a more visual explanation, I will refer you to Andrew Ng's lecture on the subject.

Here are our three logistic regression models. Model indexes ($\hat{y}_0, \hat{y}_1, \hat{y}_2$) are matched to group for clarity:

$$ \hat{y}_0(\vec{w},\vec{x}) = g(w_0 + w_1x_1 + w_2x_2 + w_3x_3)$$ $$ \hat{y}_1(\vec{w},\vec{x}) = g(w_0 + w_1x_1 + w_2x_2 + w_3x_3)$$ $$ \hat{y}_2(\vec{w},\vec{x}) = g(w_0 + w_1x_1 + w_2x_2 + w_3x_3)$$The parameters vector $\vec{w}$ with intercept and coefficients comes from the python output above:

COEF: [[-3.14850877e-01 2.89620012e-01 7.96234349e-05] [ 8.87203204e-02 -7.16339822e-01 2.78939204e-04] [ 1.03631441e-01 4.03295133e-02 -1.64381858e-04]] INTERCEPT: [-0.00533213 0.00357773 0.00162194]

Substitute into the corresponding three models:

$$ \hat{y}_0(\vec{x}) = g(-0.00533213 -0.314850877(x_1) + 0.28962001(x_2) + 0.00007962(x_3))$$ $$ \hat{y}_1(\vec{x}) = g(0.00357773 + 0.088720320(x_1) - 0.71633982(x_2) + 0.00027894(x_3))$$ $$ \hat{y}_2(\vec{x}) = g(0.00162194 + 0.103631441(x_1) + 0.04032951(x_2) - 0.00016438(x_3))$$Substitute the features vector $\vec{x}$ with all input from our first observation. Note that all three models get the same $\vec{x}$ input.

$$ \hat{y}_0 = g(-0.00533213 -0.314850877(66) + 0.28962001(35) + 0.00007962(75127))$$ $$ \hat{y}_1 = g(0.00357773 + 0.088720320(66) - 0.71633982(35) + 0.00027894(75127))$$ $$ \hat{y}_2 = g(0.00162194 + 0.103631441(66) + 0.04032951(35) - 0.00016438(75127))$$Evaluate:

$$ \hat{y}_0 = g(-4.667177922)$$ $$ \hat{y}_1 = g(1.74315053)$$ $$ \hat{y}_2 = g(-4.096546364)$$Express the logistic function:

$$ \hat{y}_0 = (\frac{1}{1+e^{--4.667177922}}) $$ $$ \hat{y}_1 = (\frac{1}{1+e^{-1.74315053}}) $$ $$ \hat{y}_2 = (\frac{1}{1+e^{--4.096546364}}) $$Double negatives become positive:

$$ \hat{y}_0 = (\frac{1}{1+e^{4.667177922}}) $$ $$ \hat{y}_1 = (\frac{1}{1+e^{-1.74315053}}) $$ $$ \hat{y}_2 = (\frac{1}{1+e^{4.096546364}}) $$Evaluate using the same steps from above:

$$ \hat{y}_0 = 0.00931... $$ $$ \hat{y}_1 = 0.85109... $$ $$ \hat{y}_2 = 0.01636... $$But wait! These numbers do not match the python output we got:

PRED: [1] PROB: [[0.01062205 0.97071551 0.01866244]]

What is going on here?

We have two different $\hat{y}_1$'s. This is confusing, so let's use $\vec{\hat{a}}$ to hold the probabilities we got from the math we just did ourselves, and $\vec{\hat{b}}$ to hold the probabilities we got from python:

$$ \hat{a}_0 = 0.00931... \hspace{1cm} \hat{b}_0 = 0.01062...$$ $$ \hat{a}_1 = 0.85109... \hspace{1cm} \hat{b}_1 = 0.97072...$$ $$ \hat{a}_2 = 0.01636... \hspace{1cm} \hat{b}_2 = 0.01866...$$We can see that the class 1 values are the biggest for both sets, but $\hat{a}_1$ and $\hat{b}_1$ are answering slightly different questions.

$$ \hat{a}_0 = \text{Does this belong to group 0?} $$ $$ \hat{a}_1 = \text{Does this belong to group 1?} $$ $$ \hat{a}_2 = \text{Does this belong to group 2?} $$ $$ \vec{\hat{b}} = \text{Does this belong to group 0, 1 or 2?} $$In other words, when we look at the $\hat{y}_1(\vec{w},\vec{x})$ model by itself we are asking "$ \text{Does this belong to group 1?} $" and find there is a 0.85109 (85%) probability that the answer is "yes". This is the number $\hat{a}_1$.

When we look at all three $\hat{y}(\vec{w},\vec{x})$ models together we are asking "$ \text{Does this belong to group 0, 1 or 2?} $" and find there is a 0.97072 (97%) probability that the answer is "group 1". This is the number $\hat{b}_1$.

This begs the question: how do we get from $\vec{\hat{a}}$ to $\vec{\hat{b}}$?

$$ \hat{b}_1 = \frac{\hat{a}_1}{\sum\vec{\hat{a}}} $$Or if you prefer:

$$ \hat{b}_1 = \frac{\hat{a}_1}{\sum_{i=0}^2 \hat{a}_i} $$In order to get the bottom term of this fraction, we just sum all elements of our $\vec{\hat{a}}$ vector together:

$$ \hat{b}_1 = \frac{0.85109}{(0.00931+0.85109+0.01636)} $$By now $\sum$ should feel like a familiar friend.

$$ \hat{b}_1 = \frac{0.85109}{0.87676} $$Easy peasy.

$$ \hat{b}_1 \approx 0.97 $$Assuming you haven't fallen into a math-induced coma, I hope you have found this article both useful and informative. If nothing else, I'd like to think this article helped a few fellow mathphobics overcome the fear of notation.

Of course we've only just scratched the surface. Both the LinearRegression() and LogisticRegression() functions can use many different cost functions and methods to estimate parameters with different strategies. We've yet to go into detail on any iterative methods, such as coordinate descent or stochastic gradient descent. Given enough time, I could easily write another 10,000 pages on this. If there is significant interest in what I've done here, maybe I will.

In the meantime, the next time you find yourself baffled by an intimidating white paper, just remember this: no matter how scary an algorithm may appear at first, underneath is just our old pal math. You've already gone from algebra through calculus in this article alone. Why stop here?

And never be afraid to ask a stupid question.

It is far more shameful to remain stupid than to admit you have more to learn.

Did you find this article useful?

Connect with us on social media and let us know: