## Introduction

Imagine you were being interviewed for a data science or machine learning job, and the hiring manager showed you the following code, and asked you the following questions. Could you answer them?

# load dependencies
from sklearn import linear_model
import numpy as np

# define X and y
X = np.matrix([[66, 35, 75125],[63,22,26750],[74,65,11500],[67,25,135600],[70,55,110500]])
y = np.matrix([[113],[123],[140],[136],[145]])

# train our model
model = linear_model.LinearRegression()
model.fit(X,y)

# print our parameters
print("COEF:",model.coef_)
print("INTERCEPT:",model.intercept_)
Interactive: find out for yourself with the optional quiz below
Attention: all data entered below will be anonymized and may be used for a future article

Any programmer with a good work ethic, and solid grasp of algebra, can get decently far. However, as we advance in the field of machine learning, it becomes obvious that a mastery of mathematics and statistics is not only beneficial, but necessary.

### The Pros and Cons of Being an Autodidact

There are two things I know to be true of every autodidact I have ever met:

1. Autodidacts eventually learn what the word autodidact means
2. Autodidacts who don't give up eventually become one of the best at whatever it is they are doing

If your education looks anything like mine, it consists of a jumbled mess of online courses, university, blog posts, work experience, forums, piles of certifications, endless Googling, a few brilliant peers, and the rest was copy and pasted directly from StackOverflow.

On the one hand, we autodidacts tend to be a rather fearless lot, plunging into white papers and beta code with reckless abandon. On the other hand, we sometimes find ourselves neck-deep in advanced materials before realizing we must have missed a necessary prerequisite somewhere along the way... the things we didn't know we didn't know...

If any of this sounds familiar to you, and you're ready to deepen your mathematical intuition regarding linear and logistic regression, welcome! This article was written especially for you.

### Bridging the Gaps

The modern world of artificial intelligence lies at the crossroads of three major disciplines:

The Holy Trinity of Machine Learning: Computer Science, Statistics, and Applied Mathematics
"Short Circuit", Tristar Pictures (1986)

I came to this world vis-a-vis computer science, by way of self-taught programmer. Over the past 20 years I have learned that it is far easier to copy and paste code from StackOverflow than to actually understand what this code is doing. When I first began reading white papers, I was absolutely terrified of mathematical notation. Weird shapes like $\sum$, $\theta$, $\partial$ and $\infty$ haunted my dreams, and as if that wasn't bad enough, even familiar variables like $y$ began wearing an odd assortment of hats and other accessories, like $\hat{y}$, $\bar{y}$, $\vec{y}$ or even $y_{2}^2$.

Even more confusing: different statisticians, computer scientists and applied mathematicians all prefer and use different notations. As a result, we will encounter many different formulas, with many different notations, all of which refer to the same ideas. We will also find these three groups all suck at naming things, resulting in a plethora of redundant and confusing labels for the same concepts. Yet another challenge: no matter the discipline, after writing pages upon pages of notation, most people make the occasional mistake, or leave out some assumed notation- obvious oversights and typos to those familiar with the notation, but confusing to those of us who are not.

Coming from a world in which a single typo can crash an entire program, I am often confused by the rough sketches of notation I find "in the wild". I don't think I am alone in this.

In this article we're going to begin with some algebra you probably first encountered somewhere around age 12, and we're going to follow it all the way through some basic statistics, a dash of calculus, and even some linear algebra. We are going to connect the dots between all these goofy math formulas and the code snippets we find online. We are going to examine, mathematically, what the above code is doing, and more importantly, why. The purpose of this article, above all else, is to help bridge the gaps between our three fields, so that we can better understand each other.

I got you, fam
"Predator", Twentieth Century Fox (1987)

Along the way I will do my best to explain all the odd notation as we encounter it, and try to help you better recognize the same concepts despite these different notations. You will also find links to Khan Academy and other resources with expanded explanations of these concepts. In a few places I have skipped a few steps for the sake of brevity, but whenever I do, you will find a link to additional resources.

I assume those of you good at math have already scrolled down to the formulas by now. But for those of you still here, those still struggling to overcome their mathphobia, as I once did, I have one more thing to say to you:

### A Word of Encouragement

In July 1998 I was a high school dropout.

In July 2008 I was a college dropout.

As of July 2018 I am currently in grad school.

I do not tell you this to impress you, but rather, because I stand as living proof that anyone can learn this stuff, so long as they are willing to put in the work. No matter which body you find yourself in, no matter where on Earth that body calls home, these tools are available to you. No one on this planet can stop you, except for yourself.

Two final things to remember before we dig in:

1. Education is not a destination, it is a lifelong journey
2. Anything worth doing is worth doing right

All are welcome to compete in the marketplace of ideas. May the best ideas win, everything else be damned.

You got this.

Now let's get to work...

## Linear Regression

Let's start with some junior high math; the formula for a line:

$$y = mx + b$$

The $y$ value on the left axis is calculated by taking our $x$ input, multiplying it by our slope $m$, and adding our y-intercept $b$. To offer a concrete example, below we have visualized $y = \frac{1}{2}x + 2$:

Some examples:

If $x = 2$ then $y = \frac{1}{2}2 + 2 = 3$.

If $x = 3$ then $y = \frac{1}{2}3 + 2 = 3.5$.

If $x = 0$ then $y = \frac{1}{2}0 + 2 = 2$.

### Using Lines to Make Predictions

Consider the following, where our $y$ axis represents weight in pounds and our $x$ axis represents height in inches:

If I told you a given person was 70 inches tall, how much would you guess they weigh?

Did you say a bit more than 135 pounds?

If so, congratulations! You just used a line to make a prediction, and this is what linear regression is all about.

$$\hat{y} = predict(x)$$

Our goal is to reduce the above line formula into a predict function. Given any height (in inches) as input ($x$), this function will return $\hat{y}$ (y hat), our predicted weight in pounds.

$$\hat{y} = mx + b$$

Once we define our slope $m$ and y-intercept $b$, we can easily calculate $\hat{y}$ for any value of $x$.

### From Numbers to Vectors

Until now, $x$ and $y$ have each represented a single number. However, the above graph is based on several observations, not just one. To accomodate this fact, let us define $\vec{x}$ as a vector of heights in inches, and $\vec{y}$ as a vector of corresponding weights in pounds.

$$\vec{x} = [66,63,74,67,70]$$ $$\vec{y} = [113,123,140,136,145]$$

Programmers familiar with lists, arrays or matrices will recognize the above syntax, and the concept is the same. A single variable $x$ now represents a series of numbers. And in math, as in programming, we can reference a single element according to its index. For example, here is the height and weight of the first observation:

$$x_1 = 66$$ $$y_1 = 113$$
Warning: In most programming languages, indexes start at 0. However, in R, MATLAB and traditional mathematics, indexes start at 1. Deal with it.

Remember: order matters. The first $x$ should correspond to the first $y$, and so on.

### The Residual

In order to calculate our regression line, and more importantly, understand how well it models the data, we must understand the residual, which is simply the difference between the actual value and our predicted value.

$$residual_i = y_i - \hat{y_i}$$

We can replace $\hat{y}$ with the line formula from above:

$$residual_i = y_i - (mx_i + b)$$

The $i$ subscript refers to the $ith$ element of each vector. For example, our 3rd residual:

$$residual_3 = y_3 - (mx_3 + b)$$

How well does this line fit this data? It may be tempting to simply sum these residuals. However, that can cause a problem when these residuals cancel each other out. Consider the following:

$$residual_2 = 123 - 120 = 3$$ $$residual_3 = 140 - 145 = -5$$

These two residuals sum to -2. That would suggest a "better than perfect" fit, which doesn't make any sense.

### The Sum of Squared Error

The sum of squared error resolves this issue by multiplying all residuals by themselves, which turns all negatives into positives.

$$SSE = \sum_{i=1}^n (y_i - (mx_i + b))^2$$

This formula is the same as the above residual formula, squared. But what about this $\sum$ doohicky? That's called a sigma. Allow me to translate this math formula directly into python:

sse = 0
for i in range(0,len(x)):
sse += (y[i]-((m*x[i])+b))**2

All this $\sum$ means is to loop over all items in both of our equal length vectors ($\vec{x}$ and $\vec{y}$), and sum the total result.

Let's visualize what's going on here:

Each of these blue dots represents one of our ($x_i, y_i$) pairs. The black line is our regression line, based on our slope $m$ and y-intercept $b$. The green lines show positive residuals, the red lines show negative residuals. The sum of squared error is simply the sum of all of these red and green lines, after we square them to remove the negatives.

But how do we define $m$ and $b$ in the first place?

### Minimize the Error

We want the values of $m$ and $b$ that minimize our sum of squared error:

$$SSE = \sum_{i=1}^n (y_i - (mx_i + b))^2$$

To find this, we begin by expanding this polynomial. Khan Academy offers an excellent step-by-step breakdown of the process for those who need it.

$$SSE = \sum_{i=1}^n(y_i^2) - 2m(\sum_{i=1}^n(x_i*y_i)) - 2b(\sum_{i=1}^n y_i) + m^2(\sum_{i=1}^n x_i^2) + 2mb(\sum_{i=1}^n x_i) + nb^2$$

Given all these sigmas, this formula is starting to look pretty awful. Let's consider the first term:

$$\sum_{i=1}^n(y_i^2)$$

Is there another way to derive this same number? Yes.

$$\sum_{i=1}^n(y_i^2) = n(\bar{y^2})$$

The sum of all $y_i^2$ is equal to $n$ times the mean (average) value of $y^2$. Consider this in different notation:

$$(y_1^2+y_2^2+...+y_n^2) = n(\frac{y_1^2+y_2^2+...+y_n^2}{n})$$

Replacing these sums with $n$ times their equivalent means allows us to go from this:

$$SSE = \sum_{i=1}^n(y_i^2) - 2m(\sum_{i=1}^n(x_i*y_i)) - 2b(\sum_{i=1}^n y_i) + m^2(\sum_{i=1}^n x_i^2) + 2mb(\sum_{i=1}^n x_i) + nb^2$$

To this:

$$SSE = n(\bar{y^2}) - 2mn(\bar{xy}) - 2bn(\bar{y}) + m^2n(\bar{x^2}) + 2mbn(\bar{x}) + nb^2$$

### A Dash of Calculus

From here we need to delve into a bit of calculus. Specifically, we need to find the partial derivatives ($\frac{\partial SSE}{\partial m}$ and $\frac{\partial SSE}{\partial b}$) and set them to 0. Khan Academy offers complete video instructions for those who need a refresher, or are new to multivariate calculus.

The partial derivative ($\frac{\partial SSE}{\partial m}$) uses the same rules of a regular derivative ($\frac{dSSE}{dm}$), except everything aside from our target variable is treated as a constant. For example, when calculating the partial derivative of $SSE$ with respect to $m$ we treat $b$, $n$, $x$ & $y$ as constants. The derivative of a constant is always 0, which has been included below to make these equations more readable:

$$SSE = n(\bar{y^2}) - 2mn(\bar{xy}) - 2bn(\bar{y}) + m^2n(\bar{x^2}) + 2mbn(\bar{x}) + nb^2$$ $$\frac{\partial SSE}{\partial m} = 0 - 2n(\bar{xy}) - 0 + 2mn(\bar{x^2}) + 2bn(\bar{x}) + 0$$ $$\frac{\partial SSE}{\partial b} = 0 - 0 - 2n(\bar{y}) + 0 + 2mn(\bar{x}) + 2bn$$

### Showboating Globetrotter Algebra

If the above calculus confused you, good news! From here we just need some more algebra.

The greatest minds in the universe

We divide the partial derivatives by $2n$ and drop the 0's:

$$\frac{\partial SSE}{\partial m} = -(\bar{xy}) + m(\bar{x^2}) + b(\bar{x})$$ $$\frac{\partial SSE}{\partial b} = -(\bar{y}) + m(\bar{x}) + b$$

If we rearrange terms, this starts to resemble the line formula again. Let's start with the second partial derivative:

$$\frac{\partial SSE}{\partial b} = -(\bar{y}) + m(\bar{x}) + b = 0$$ $$\bar{y} = m(\bar{x}) + b$$

We can do the same with the other partial derivative:

$$\frac{\partial SSE}{\partial m} = -(\bar{xy}) + m(\bar{x^2}) + b(\bar{x}) = 0$$ $$\bar{xy} = m(\bar{x^2}) + b(\bar{x})$$ $$\frac{\bar{xy}}{\bar{x}} = m(\frac{\bar{x^2}}{\bar{x}}) + b$$

We can now solve for $m$ by treating these as a system of equations, and subtracting the bottom from the top:

\begin{alignat*}{5} & & \bar{y} & = & m(\bar{x}) & + b \\ & - & \frac{\bar{xy}}{\bar{x}} & = & m(\frac{\bar{x^2}}{\bar{x}}) & + b \\ \hline & & \bar{y}-\frac{\bar{xy}}{\bar{x}} & = & m(\bar{x}-\frac{\bar{x^2}}{\bar{x}}) & \\ \end{alignat*}

Now we solve for $m$ by dividing $(\bar{x}-\frac{\bar{x^2}}{\bar{x}})$ from both sides:

$$m = \frac{\bar{y}-\frac{\bar{xy}}{\bar{x}}}{\bar{x}-\frac{\bar{x^2}}{\bar{x}}}$$

Clean it up by multiplying both sides by 1 in the form of $\frac{\bar{x}}{\bar{x}}$:

$$m = \frac{(\bar{x})(\bar{y})-(\bar{xy})}{(\bar{x})^2-(\bar{x^2})}$$

Once we have $m$, finding $b$ is considerably easier:

$$\bar{y} = m(\bar{x}) + b$$

Subtract $m(\bar{x})$ from both sides:

$$\bar{y} - m(\bar{x}) = b$$

Rearrange terms:

$$b = \bar{y} - m(\bar{x})$$

### Using the Formulas

Now that we have the formulas to find $m$ and $b$, we can do regression with some more junior high math:

$$m = \frac{(\bar{x})(\bar{y})-(\bar{xy})}{(\bar{x})^2-(\bar{x^2})}$$

Terms with a bar over top are just means (averages).

$$\vec{x} = [66,63,74,67,70]$$ $$(\bar{x}) = \frac{(66+63+74+67+70)}{5}$$

What about this $(\bar{xy})$ doohicky?

$$\vec{y} = [113,123,140,136,145]$$ $$(\bar{xy}) = \frac{((66*113)+(63*123)+(74*140)+(67*136)+(70*145))}{5}$$

Note that the bottom two terms are different:

$$(\bar{x})^2 = (\bar{x}*\bar{x}) = (\frac{66+63+74+67+70}{5})^2$$ $$(\bar{x^2}) = \frac{(66^2+63^2+74^2+67^2+70^2)}{5}$$

Once we find all of our means we can just pop them into the formula:

$$m = \frac{(\bar{x})(\bar{y})-(\bar{xy})}{(\bar{x})^2-(\bar{x^2})} \approx 2.19$$

Note the squiggly equal sign ($\approx$). This means an approximate value (since I am rounding to 2.19 here).

We find $b$ with more substitution:

$$b = \bar{y} - m(\bar{x})$$ $$b = \bar{y} - 2.19(\bar{x})$$ $$b \approx -17$$

And now we have our prediction function:

$$\hat{y} = mx + b$$ $$\hat{y} = 2.19x - 17$$

Given an input of 70 inches, it returns a prediction of about 136 pounds:

$$136 \approx 2.19(70) - 17$$

### Terminology Review

Let's pause for a moment and connect the above mathematics to some key terms used in the field of machine learning.

The Model
aka univariate linear regression

This model accepts a single feature ($x$) and uses the parameters ($m$ and $b$) to calculate our predicted value ($\hat{y}$).

$$\hat{y} = mx + b$$

The Cost Function
aka the error function, squared error, sum of squared error, sum of squares

This cost function sums the squared difference between the actual value ($y$) and predicted value ($\hat{y}$) across all single-variable observations in the vector ($\vec{x}$), for any given parameters ($m$ and $b$).

$$SSE = \sum_{i=1}^n (y_i - (mx_i + b))^2$$

The Method for Estimating Parameters
aka fitting the model, training the model, optimizing our parameters

We found the $m$ and $b$ parameters with the smallest total cost (least error) by taking the partial derivative of our cost function with respect to $b$, and with respect to $m$, and setting both equal to 0. In machine learning it is common (but not necessary) to use iterative and stochastic methods instead (e.g. gradient descent).

$$\frac{\partial SSE}{\partial m} = \frac{\partial SSE}{\partial b} = 0$$

All of the above math breaks into three distinct components.

Remember playing with LEGO? This is kinda like that...
"The Lego Movie", Warner Bros. (2014)

For example, we could replace our linear regression model with an SVM or kNN. And instead of using our deterministic method to estimate our parameters, we could use gradient descent instead. And no matter which method or algorithm we use to optimize parameters, it will need a cost function to determine how well any given parameters fit any given data once entered into our selected model.

## Multivariate Linear Regression

In the previous example we used height (our independent, $x$ variable) to predict weight (our dependent, $y$ variable). In this example, in addition to height, we will add the corresponding ages and incomes for each observation. This same math can extend to any number of features, which is why we call it multivariate.

Recall our single-variable model:

$$\hat{y} = mx + b$$

We can imagine the multi-variable model like so:

$$\hat{y} = m_1x_1 + m_2x_2 + m_3x_3 + b$$

Each of our $x$ features is multiplied by a corresponding slope $m$, and these terms are summed together with $b$.

Although the concept is the same, the python sklearn documentation has it as:

$$\hat{y}(\vec{w},\vec{x}) = w_0 + w_1x_1 + w_2x_2 + w_3x_3 + ... + w_px_p$$

In this notation, $\vec{w}$ and $\vec{x}$ are both vectors, and unlike our single-variable notation, each $\vec{x}$ vector represents all features from a single observation. Consider this in table form:

Observation $x_1$ (height) $x_2$ (age) $x_3$ (income)
1 66 35 75125
2 63 22 26750
... ... ... ...
5 70 55 110500

To calculate $\hat{y}$ (our predicted value) for our first observation (row 1) we provide an $\vec{x}$ vector like so:

$$\vec{x} = [66,35,75125]$$

We can plug this into the above formula:

$$\hat{y} = w_0 + w_166 + w_235 + w_375125$$

What about our $\vec{w}$ vector? It begins with our intercept ($b$ in the single-variable formula is the same as $w_0$ here). The rest of our $\vec{w}$ vector is comprised of the coefficients (slopes) for each of our three $x$ features. In the single-variable example above our mission was to define $m$ and $b$; here our mission is to define our $\vec{w}$ vector instead.

### Enter the Matrix

As with the single-variable example above, we will use a deterministic method to optimizize our parameters. Specifically, we will use the least squares method, which is the same default approach used by the LinearRegression class from the python sklearn package and lm function from the R stats package.

In order to calculate least squares, we use linear algebra, which means changing up our notation yet again:

$$\vec{y} = X\vec{b} + \vec{\epsilon}$$

This still resembles the basic line formula, except $\vec{y}$, $\vec{b}$, and $\vec{\epsilon}$ are vectors, and capital $X$ is a matrix of all features. In this notation, $\vec{b}$ holds our coefficients ($\vec{w}$ above) and $\vec{\epsilon}$ (epsilon) holds our errors.

If we expand this formula it looks like this:

$$\begin{bmatrix} y_{1} \\ y_{2} \\ y_{3} \\ y_{4} \\ y_{5} \end{bmatrix} = \begin{bmatrix} x_{1,1} & x_{1,2} & x_{1,3} \\ x_{2,1} & x_{2,2} & x_{2,3} \\ x_{3,1} & x_{3,2} & x_{3,3} \\ x_{4,1} & x_{4,2} & x_{4,3} \\ x_{5,1} & x_{5,2} & x_{5,3} \end{bmatrix} * \begin{bmatrix} b_{1} \\ b_{2} \\ b_{3} \\ b_{4} \\ b_{5} \end{bmatrix} + \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \epsilon_{3} \\ \epsilon_{4} \\ \epsilon_{5} \end{bmatrix}$$

You will notice that the items in the $X$ matrix each have two indexes; the first refers to the row number, the second to the feature number. In this example: $x_{1,1}$ is the height of the 1st observation, $x_{3,2}$ is the age of the 3rd observation, and $x_{5,3}$ is the income of the 5th observation. Recall the above table:

Observation $x_1$ (height) $x_2$ (age) $x_3$ (income)
1 66 35 75125
... ... ... ...
5 70 55 110500

As before, our $\vec{y}$ vector contains the dependent variable (the thing we are trying to predict), in this example, weight in pounds. The first column of the $X$ matrix is the same as the $\vec{x}$ vector of the single-variable example; here it is joined by the other two features in parallel columns. To make this crystal clear, here is the same expanded formula with known values filled in:

$$\begin{bmatrix} 113 \\ 123 \\ 140 \\ 136 \\ 145 \end{bmatrix} = \begin{bmatrix} 66 & 35 & 75125 \\ 63 & 22 & 26750 \\ 74 & 65 & 11500 \\ 67 & 25 & 135600 \\ 70 & 55 & 110500 \end{bmatrix} * \begin{bmatrix} b_{1} \\ b_{2} \\ b_{3} \\ b_{4} \\ b_{5} \end{bmatrix} + \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \epsilon_{3} \\ \epsilon_{4} \\ \epsilon_{5} \end{bmatrix}$$

Key point: we can accommodate more observations by adding more rows to this formula, and accommodate more features by adding more columns to our $X$ matrix. This can be expanded to any size we wish; this example uses 5 observations and 3 features. And just as with regular algebra, and showboating globetrotter algebra, linear algebra allows us to manipulate equations based on the relationships between unknown variables to extract new insights.

### The Cost Function

As with the single-variable example, we will use the sum of squared error here too. Given that our linear algebra formula contains a specific error term ($\epsilon$), we just need to square and sum it:

$$SSE = \sum_{i=1}^n \epsilon_i^2$$

However, due to the rules of matrix multiplication, we cannot multiply the $\epsilon$ vector by itself until we transponse it. For those who need it, Khan Academy offers an excellent lesson on matrix multiplication. I also find this matrix calculator handy for checking my work. Key point: in linear algebra, the shapes of matrices and vertices matters a lot.

Recall that in linear algebra notation our $\epsilon$ vector looks like this:

$$\vec{\epsilon} = \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \epsilon_{3} \\ \epsilon_{4} \\ \epsilon_{5} \end{bmatrix}$$

We denote transpose with a superscript capital T ($\vec{\epsilon}^T$), or sometimes with a little dash like so ($\vec{\epsilon}'$):

$$\vec{\epsilon}^T = \vec{\epsilon}' = \begin{bmatrix} \epsilon_{1} & \epsilon_{2} & \epsilon_{3} & \epsilon_{4} & \epsilon_{5} \\ \end{bmatrix}$$

Ergo:

$$\vec{\epsilon}^2 = \vec{\epsilon}^T\vec{\epsilon} = \begin{bmatrix} \epsilon_{1} & \epsilon_{2} & \epsilon_{3} & \epsilon_{4} & \epsilon_{5} \\ \end{bmatrix} * \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \epsilon_{3} \\ \epsilon_{4} \\ \epsilon_{5} \end{bmatrix}$$

Or to put it another way:

$$SSE = \sum_{i=1}^n \epsilon_i^2 = \vec{\epsilon}^T\vec{\epsilon} = \begin{bmatrix} \epsilon_{1} & \epsilon_{2} & \epsilon_{3} & \epsilon_{4} & \epsilon_{5} \\ \end{bmatrix} * \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \epsilon_{3} \\ \epsilon_{4} \\ \epsilon_{5} \end{bmatrix}$$

Different notation, different rules of multiplication, but the same concept.

Nothing to be afraid of.

If Homer Simpson can figure this out, so can you.

### Optimize the Parameters

Aside from the details of linear algebra, the logic is essentially the same here as with our single-variable example.

$$\vec{y} = X\vec{b} + \vec{\epsilon}$$

Solve for $\vec{\epsilon}$:

$$\vec{\epsilon} = \vec{y} - X\vec{b}$$

Square it (transpose):

$$SSE = \vec{\epsilon}^T\vec{\epsilon} = (\vec{y} - X\vec{b})^T(\vec{y} - X\vec{b})$$

We distribute the transpose inside the parenthesis:

$$SSE = (\vec{y}^T - X^T\vec{b}^T)(\vec{y} - X\vec{b})$$

From here we can expand like any other polynomial:

$$SSE = \vec{y}^T\vec{y} - \vec{b}^TX^T\vec{y} -\vec{y}^TX\vec{b} + \vec{b}^TX^TX\vec{b}$$

I am going to skip a few steps for brevity. See this post from Economic Theory Blog for details or page 7 of this paper by Andrew and Constable for a full proof. Key point: we use a few linear algebra tricks to simplify the equation further.

$$SSE = \vec{y}^T\vec{y} = 2\vec{b}^TX^T\vec(y)+\vec{b}^TX^TX\vec{b}$$

Again our goal is to minimize our cost function, so again we need to find the partial derivative and set it to 0, this time with respect to $b$.

$$\frac{\partial SSE}{\partial b} = -2X^T\vec{y} + 2X^TX\vec{b} = 0$$

And from here we simplify further and solve for $b$:

$$\vec{b} = (X^TX)^{-1}X^T\vec{y}$$

### Using the Formula

Now that we have a formula for $\vec{b}$, we can get our coefficients (slopes) and our intercept ($w_0$).

$$\vec{b} = (X^TX)^{-1}X^T\vec{y}$$

Recall our $\vec{y}$ vector of weight (in pounds):

$$\vec{y} = \begin{bmatrix} 113 \\ 123 \\ 140 \\ 136 \\ 145 \end{bmatrix}$$

Recall our $X$ matrix of three features: height (in inches), age (in years), and income (in dollars):

$$X = \begin{bmatrix} 66 & 35 & 75125 \\ 63 & 22 & 26750 \\ 74 & 65 & 11500 \\ 67 & 25 & 135600 \\ 70 & 55 & 110500 \end{bmatrix}$$

In order to calculate our intercept ($w_0$) we will add a column of 1's to our $X$ matrix:

$$X = \begin{bmatrix} 1 & 66 & 35 & 75125 \\ 1 & 63 & 22 & 26750 \\ 1 & 74 & 65 & 11500 \\ 1 & 67 & 25 & 135600 \\ 1 & 70 & 55 & 110500 \end{bmatrix}$$

In order to express the first term of our formula, we must transpose $X$:

$$X^TX = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 66 & 63 & 74 & 67 & 70 \\ 35 & 22 & 65 & 25 & 55 \\ 75125 & 26750 & 11500 & 135600 & 110500 \end{bmatrix} * \begin{bmatrix} 1 & 66 & 35 & 75125 \\ 1 & 63 & 22 & 26750 \\ 1 & 74 & 65 & 11500 \\ 1 & 67 & 25 & 135600 \\ 1 & 70 & 55 & 110500 \end{bmatrix}$$

Now we can multiply:

$$X^TX = \begin{bmatrix} 5 & 340 & 202 & 359475 \\ 340 & 23190 & 14031 & 24314700 \\ 202 & 14031 & 9584 & 13432875 \\ 359475 & 24314700 & 13432875 & 37089188125 \end{bmatrix}$$

Now what about this ($^{-1}$) doohicky? This means we need the inverse of this new combined matrix ($X^TX$). Working out the inverse of larger matrices can get tricky. Khan Academy has a complete video series on the subject. Numbers are rounded here for cleaner notation:

$$(X^TX)^{-1} = \begin{bmatrix} 434.7838406064 & -7.36983664257 & 1.5438829428 & 0.0000583096 \\ -7.3698364247 & 0.1256055225 & -0.0269254232 & -0.0000011621 \\ 1.5438829428 & -0.0269254232 & 0.0065308471 & 0.0000003227 \\ 0.0000583096 & -0.0000011621 & 0.0000003227 & 0.0000000001 \end{bmatrix}$$

Our second term:

$$X^T\vec{y} = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 66 & 63 & 74 & 67 & 70 \\ 35 & 22 & 65 & 25 & 55 \\ 75125 & 26750 & 11500 & 135600 & 110500 \end{bmatrix} * \begin{bmatrix} 113 \\ 123 \\ 140 \\ 136 \\ 145 \end{bmatrix}$$

More matrix multiplication:

$$X^T\vec{y} = \begin{bmatrix} 657 \\ 44829 \\ 27136 \\ 47853475 \end{bmatrix}$$

Now we multiply terms together:

$$(X^TX)^{-1}X^T\vec{y} = \begin{bmatrix} 434.7838406064 & -7.36983664257 & 1.5438829428 & 0.0000583096 \\ -7.3698364247 & 0.1256055225 & -0.0269254232 & -0.0000011621 \\ 1.5438829428 & -0.0269254232 & 0.0065308471 & 0.0000003227 \\ 0.0000583096 & -0.0000011621 & 0.0000003227 & 0.0000000001 \end{bmatrix} * \begin{bmatrix} 657 \\ 44829 \\ 27136 \\ 47853475 \end{bmatrix}$$

And violà! Here is our $\vec{b}$ vector (rounded):

$$\vec{b} = (X^TX)^{-1}X^T\vec{y} = \begin{bmatrix} -44.29148 \\ 2.52675 \\ -0.04592 \\ 0.00008 \end{bmatrix}$$

The first item in our $\vec{b}$ vector is the intercept ($w_0$). The rest of the $\vec{b}$ vector contains our coefficients ($w_1, w_2, w_3$).

Recall the above model formula:

$$\hat{y} = w_0 + w_1x_1 + w_2x_2 + w_3x_3$$

Substitute our intercept ($w_0$) and coefficients:

$$\hat{y} = −44.29148 + 2.52675(x_1) −0.04592(x_2) + 0.00008(x_3)$$

Let us predict the weight (in pounds) of a person with a height of 66 inches, age of 35 years, and income of $75,125: $$\hat{y} = −44.29148 + 2.52675(66) −0.04592(35) + 0.00008(75125)$$ This new multivariate linear regression model returns a prediction of roughly 127 pounds. $$127 \approx −44.29148 + 2.52675(66) −0.04592(35) + 0.00008(75125)$$ ### Terminology Review Just to make everything crystal clear: The Model aka multivariate linear regression This model accepts a features vector ($\vec{x}$) and uses a parameter vector ($\vec{w}$) to calculate our predicted value ($\hat{y}$) for any single observation (row of data). $$\hat{y} = w_0 + w_1x_1 + w_2x_2 + w_3x_3$$ The Cost Function aka the error function, squared error, sum of squared error, sum of squares This cost function sums our squared error ($\vec{\epsilon}^2$) across all observations in the feature matrix ($ X $), for any given parameter vector ($ \vec{b} $) using linear algebra. $$SSE = (\vec{y} - X\vec{b})^T(\vec{y} - X\vec{b})$$ The Method for Estimating Parameters aka fitting the model, training the model, optimizing our parameters, least squares method, ordinary least squares We used the least squares method to minimize our cost function and find the parameter vector with the smallest total cost (least error) by taking the partial derivative of our cost function with respect to$b$and setting it equal to 0. Once again we used a deterministic method as opposed to an iterative and stochastic method like gradient descent. $$\frac{\partial SSE}{\partial b} = 0$$ ### Bridging the Gap to Python Now that we've paid our dues, let's make python do this math for us: # load dependencies from sklearn import linear_model import numpy as np # define X and y and look at them X = np.matrix([[66, 35, 75125],[63,22,26750],[74,65,11500],[67,25,135600],[70,55,110500]]) y = np.matrix([[113],[123],[140],[136],[145]]) print(X) print(y) # train our model model = linear_model.LinearRegression() model.fit(X,y) # we get the same parameters print("COEF:",model.coef_) print("INTERCEPT:",model.intercept_) # we get the same prediction X_input = np.matrix([[66,35,75125]]) prediction = model.predict(X_input) print(prediction)  We get the same parameters: COEF: [[ 2.52674720e+00 -4.59219855e-02 7.96706668e-05]] INTERCEPT: [-44.29148412]  We get the same prediction: [[126.85182057]]  And now we can do it with pencil and paper. Assuming we lose power during the inevitable robot apocalypse, that may come in handy. ## Logistic Regression We've covered a lot of math so far, but what about$\infty$,$e$, and the rules of negative exponents? First, let's take a step back and consider why we should even want to use logistic regression in the first place. In the linear regression examples above, our prediction ($\hat{y}$) is a continuous value (weight in pounds), as opposed to a discrete (aka categorical) value. In logistic regression, our goal is to predict a discrete value: which group an observation belongs to. To better understand the difference, imagine if we divided our 5 observations into two groups: those who love broccoli (marked with 1), and those who don't (marked with 0). Consider in table form: Observation$ x_1 $(height)$ x_2 $(age)$ x_3 $(income)$ y $(love broccoli) 1 66 35 75125 1 2 63 22 26750 1 ... ... ... ... ... 5 70 55 110500 0 According to this table data, observation 1 and 2 love broccoli, and observation 5 does not. I intentionally picked a silly example to remind us that observations can be grouped by literally anything Medical News Today Before we had a vector of weights in pounds: $$\vec{y} = [113,123,140,136,145]$$ And we predicted a continuous weight in pounds: $$\hat{y} = 126.8518...$$ Now we have a boolean vector (0's and 1's) representing love of broccoli: $$\vec{y} = [1,1,0,1,0]$$ And we want to predict which of these two groups (0 or 1) an observation belongs to. $$\hat{y} = 0 \hspace{1cm}\text{ or }\hspace{1cm} \hat{y} = 1$$ In mathematics, we can define these discrete values like so: $$y \in \{0, 1\}$$ ### The Logistic Regression Model Good news! If you've been paying attention so far, this model should look familiar: $$\hat{y}(\vec{w},\vec{x}) = g(w_0 + w_1x_1 + w_2x_2 + w_3x_3 + ... + w_px_p)$$ This is almost identical to our multivariate linear regression model, however, there are two gotchas here: 1. Everything is wrapped in a$g()$function. 2. The output ($\hat{y}$) as defined here is not yet in proper 0 or 1 format. $$\hat{y} \neq y \in \{0, 1\}$$ Let's look at these issues one at a time. ### The Logistic Function That$g()$function above? That's the logistic function, also known as the sigmoid function, which is our activation function for this model. As you might have guessed, this is where logistic regression gets its name. Whatever you call it, this function looks like this: $$g(z) = \frac{1}{1+e^{-z}}$$ This formula contains ($e^{-z}$) which is a tricky term if you've never seen something like this before. This is euler's constant to a negative power. And right about now you might be asking: what the hell is a negative power? Don't panic. When$z$is negative,$-z$switches back to positive. Recall how positive powers work: $$2^2=4 \hspace{1cm} 2^3 = 8 \hspace{1cm} 3^3 = 27$$ Negative powers work like so: $$2^{-2}=\frac{1}{4} \hspace{1cm} 2^{-3}=\frac{1}{8} \hspace{1cm} 3^{-3}=\frac{1}{27}$$ Easy enough. Now, what about this$e$doohicky? ### Defining Euler's Constant Have you ever seen little kids arguing over which is the biggest number? Usually one clever kid will exclaim: "infinity!", only to be outwitted by the first to shout "infinity plus one!". Well, I've got some bad news, kids. $$\infty + 1 \neq \text{math}$$ "Infinity plus one" doesn't mean anything, mathematically speaking, because$\infty$is a limit, not a number. Who cares? What does that have to do with$e$? $$e = \mathop {\lim }\limits_{n \to \infty } \left( {1 + \frac{1}{n}} \right)^n$$ If you are unfamiliar with calculus, the above may cause a panic attack. That's normal. Take a deep breath. All this means is that$e$is equal to the limit of this function as$n$approaches$\infty$. What does this mean? Let$n = 5$for the sake of example: $$({1 + \frac{1}{5}})^5 \approx 2.48$$ Let$n = 50$for the sake of example: $$({1 + \frac{1}{50}})^{50} \approx 2.69$$ Let$n = 500$for the sake of example: $$({1 + \frac{1}{500}})^{500} \approx 2.71$$ Let$n = 5000$for the sake of example: $$({1 + \frac{1}{5000}})^{5000} \approx 2.72$$ Now imagine if we continued increasing$n$towards$\infty$. Of course, we will never actually reach$\infty$, because$\infty$isn't a number. But in some cases, like this one, we will converge towards a specific number. $$e = \mathop {\lim }\limits_{n \to \infty } \left( {1 + \frac{1}{n}} \right)^n \approx 2.718281828...$$ As we continue to increase$n$to 5000 and beyond, we will continue to gain precision to a greater number of decimal points. In reality,$e$just keeps on going and going, just like$\pi$. At a certain point we just use a given decimal value ($\pi \approx 3.14$) because it's "good enough" for a solid approximation ($e \approx 2.72$). ### Back to the Logistic Function Now that we understand$e$and$\infty$, we can take another look at our model: $$\hat{y}(\vec{w},\vec{x}) = g(w_0 + w_1x_1 + w_2x_2 + w_3x_3 + ... + w_px_p)$$ Specifically, let's consider this part: $$w_0 + w_1x_1 + w_2x_2 + w_3x_3$$ Given that$x_1$,$x_2$and$x_3$might be any number, in theory this calculation could return anything from$-\infty$to$\infty$. And whatever it returns will go into our logistic function as$z$. $$g(z) = \frac{1}{1+e^{-z}}$$ Substitute$e$with our decimal approximation: $$g(z) \approx \frac{1}{1+2.72^{-z}}$$ Bad Notation Warning: normally we wouldn't mix fractions and decimals like this (since they are different forms of the same concept). I did this to make these steps easier to follow. Let$z = 5$for the sake of example: $$g(5) \approx \frac{1}{1+2.72^{-5}}$$ Don't let the negative exponent stop you: $$g(5) \approx \frac{1}{1+\frac{1}{2.72^5}}$$ Find$2.72$to the$5th$power: $$g(5) \approx \frac{1}{1+\frac{1}{148.882797363}}$$ Convert the 1 in the denominator to another fraction/decimal: $$g(5) \approx \frac{1}{\frac{148.882797363}{148.882797363}+\frac{1}{148.882797363}}$$ Add the fractions in the denominator: $$g(5) \approx \frac{1}{\frac{149.882797363}{148.882797363}}$$ Evaluate the fraction in the denominator via division to convert to a decimal: $$g(5) \approx \frac{1}{1.00671669271}$$ Divide our numerator by our denominator: $$g(5) \approx 0.993...$$ Long story short, the logistic function converges to 1 as it approaches$\infty$from the left: $$\mathop {\lim }\limits_{z \to \infty^- } \left(\frac{1}{1+e^{-z}} \right) = 1$$ It also converges to 0 as it approaches$-\infty$from the right: $$\mathop {\lim }\limits_{z \to -\infty^+ } \left(\frac{1}{1+e^{-z}} \right) = 0$$ Let's visualize what that means: Note the distinct "S" shape, hence why this is sometimes called the sigmoid function. We can see this takes any input and "squishes" it down to an output between 0 and 1. And herein lies the magic of logistic regression. ### Bridging the Gap from Python Let's return to python for a moment: # load dependencies from sklearn import linear_model import numpy as np # define X and y X = np.matrix([[66, 35, 75125],[63,22,26750],[74,65,11500],[67,25,135600],[70,55,110500]]) y = [1,1,0,1,0] # train our model model = linear_model.LogisticRegression() model.fit(X,y) # print our parameters print("COEF:",model.coef_) print("INTERCEPT:",model.intercept_)  According to the documentation, the above code uses the liblinear package to estimate our parameters (our needed$w$vector). By default, this relies on coordinate descent, which is an iterative algorithm. This means it must work through a calculation again and again to get to the right answer. Explaining this math in the same level of detail deserves an article of its own. In the meantime I will direct you to Andrew Ng's lecture on the subject from his Coursera class on machine learning. For now, let's focus on the model itself, and let python optimize the parameters for us: $$w_1, w_2, w_3$$ COEF: [[ 2.12455983e-01 -3.25565862e-01 2.69547674e-07]] $$w_0$$ INTERCEPT: [0.00471522] Notice how small these coefficients are (rounded): $$\vec{w} = \begin{bmatrix} 0.00471522 \\ 0.21245598 \\ -0.3255659 \\ 0.00000027 \end{bmatrix}$$ If we take another peek at the sigmoid graph this makes sense: If we follow the$z$input on the above graph, we can see that any value below$-5$, or over$5$, is overkill for the logistic function. These small coefficients take our raw feature input$\vec{x}$and calculates something more reasonable given this$-5$to$5$range. Let's go back to python and make a prediction based on our first observation: # make a prediction X_input = np.matrix([[66,35,75125]]) prediction = model.predict(X_input) print("PRED:",prediction) # what is our probability? prob = model.predict_proba(X_input) print("PROB:",prob)  What does our model think? PRED: [1] PROB: [[0.06585018 0.93414982]]  Our model says this person loves broccoli. But where did this probability come from? ### Checking Our Work Recall our model formula: $$\hat{y}(\vec{w},\vec{x}) = g(w_0 + w_1x_1 + w_2x_2 + w_3x_3 + ... + w_px_p)$$ Substitute the parameters vector$\vec{w}$with intercept and coefficients: $$\hat{y}(\vec{x}) = g(0.00471522 + 0.21245598(x_1) -0.3255659(x_2) + 0.00000027(x_3))$$ Substitute the features vector$\vec{x}$with all input from our first observation: $$\hat{y} = g(0.00471522 + 0.21245598(66) -0.3255659(35) + 0.00000027(75125))$$ Evaluate: $$\hat{y} = g(2.6522871499999967)$$ Express the logistic function: $$\hat{y} = (\frac{1}{1+e^{-2.6522871499999967}})$$ Evaluate using the same steps from above: $$\hat{y} = 0.9341...$$ What we have here is the probability of this person loving broccoli. And because total probability always sums to$1$, we know that the probability of NOT loving broccoli is just ($1-0.9341...=0.0659...)$. More formally we can say: $$\hat{y}_1 = p(y=1|\vec{w},\vec{x}) = 0.93414982$$ In words: "y-hat one equals the probability that y equals one given this parameters vector w and features vector x". And knowing total probabilities always sum to 1, we can deduce: $$\hat{y}_0 = 1-p(y=1|\vec{w},\vec{x}) = 0.06585018$$ Compare to what python gave us: PROB: [[0.06585018 0.93414982]]  Python returns a list of all probabilities in order of class. In other words, class 0 (NOT "love broccoli") has a 0.06585018 probability. Class 1 ("love broccoli") has a 0.93414982 probability. And based on these probabilities, using whichever is the highest, we can return a prediction in our desired form: $$y \in \{0, 1\}$$ And thus we turn our original linear model into a model capable of classification, and even better, a model capable of providing a probability estimate for each of those classifications. ## Multiclass Logistic Regression In the above example, we measured a person's opinion of broccoli with a boolean variable. In the real world, most opinions are far more nuanced than this. Does logistic regression force us to reduce everything into such simple terms? Nope. Multiclass logistic regression allows us to divide observations into any number of classes (groups) we want. Imagine we wanted to group our observations into those who love broccoli (1), hate broccoli (0), or are indifferent to broccoli (2). $$y \in \{0, 1, 2\}$$ Our$\vec{y}$vector might look like this: $$\vec{y} = [1,2,2,1,0]$$ Those familiar with Futurama may notice a problem here: Sadly, in our world, there is such a thing as 2, to say nothing of 3, 4, and 5. What if we defined groups like this? $$y \in \{0, 1, 2, 3, 4, 5, 6, 7\}$$ Won't these larger numbers mess up all our math from earlier? Let's find out. ### Bridging the Gap from Python Let's return to python for a moment: # load dependencies from sklearn import linear_model import numpy as np # define X and y X = np.matrix([[66, 35, 75125],[63,22,26750],[74,65,11500],[67,25,135600],[70,55,110500]]) y = [1,2,2,1,0] # train our model model = linear_model.LogisticRegression() model.fit(X,y) # print our parameters print("COEF:",model.coef_) print("INTERCEPT:",model.intercept_)  This code is identical to the ($0, 1$) example, except we have changed$\vec{y}$to include 2's here. What do we get when we run it? COEF: [[-3.14850877e-01 2.89620012e-01 7.96234349e-05] [ 8.87203204e-02 -7.16339822e-01 2.78939204e-04] [ 1.03631441e-01 4.03295133e-02 -1.64381858e-04]] INTERCEPT: [-0.00533213 0.00357773 0.00162194] Now we have 3 sets of coefficients, and 3 intercepts. What is going on here? Let's predict for our first observation (still labeled as a 1, "love broccoli"): # make a prediction X_input = np.matrix([[66,35,75125]]) prediction = model.predict(X_input) print("PRED:",prediction) # what is our probability? prob = model.predict_proba(X_input) print("PROB:",prob)  This output looks more familiar: PRED: [1] PROB: [[0.01062205 0.97071551 0.01866244]]  We predict this observation has a 0.01062205 probability for class 0 ("hate broccoli"), a 0.97071551 probability for class 1 ("love broccoli"), and a 0.01866244 probability for class 2 ("indifferent to broccoli"). Our final prediction is that they belong to group 1 ("love broccoli"). But where are all these numbers coming from? As above, we will leave estimating the parameters to python (for the same reasons), and focus on the model. ### One v. Rest The key to multiclass logistic regression lies in one v. rest, sometimes called one v. all. The basic idea is this: instead of training one model, we just trained three. Each use the same formula as the boolean ($0, 1$) version: $$\hat{y}(\vec{w},\vec{x}) = g(w_0 + w_1x_1 + w_2x_2 + w_3x_3 + ... + w_px_p)$$ Originally we were asking: $$\text{Does this belong to group 0, 1 or 2?}$$ We have altered this into three yes or no (boolean) questions instead: $$\text{Does this belong to group 0?}$$ $$\text{Does this belong to group 1?}$$ $$\text{Does this belong to group 2?}$$ By rephrasing our question into something boolean, we can reuse our prior work. If you would prefer a more visual explanation, I will refer you to Andrew Ng's lecture on the subject. ### Checking Our Work Here are our three logistic regression models. Model indexes ($\hat{y}_0, \hat{y}_1, \hat{y}_2$) are matched to group for clarity: $$\hat{y}_0(\vec{w},\vec{x}) = g(w_0 + w_1x_1 + w_2x_2 + w_3x_3)$$ $$\hat{y}_1(\vec{w},\vec{x}) = g(w_0 + w_1x_1 + w_2x_2 + w_3x_3)$$ $$\hat{y}_2(\vec{w},\vec{x}) = g(w_0 + w_1x_1 + w_2x_2 + w_3x_3)$$ The parameters vector$\vec{w}$with intercept and coefficients comes from the python output above: COEF: [[-3.14850877e-01 2.89620012e-01 7.96234349e-05] [ 8.87203204e-02 -7.16339822e-01 2.78939204e-04] [ 1.03631441e-01 4.03295133e-02 -1.64381858e-04]] INTERCEPT: [-0.00533213 0.00357773 0.00162194] Substitute into the corresponding three models: $$\hat{y}_0(\vec{x}) = g(-0.00533213 -0.314850877(x_1) + 0.28962001(x_2) + 0.00007962(x_3))$$ $$\hat{y}_1(\vec{x}) = g(0.00357773 + 0.088720320(x_1) - 0.71633982(x_2) + 0.00027894(x_3))$$ $$\hat{y}_2(\vec{x}) = g(0.00162194 + 0.103631441(x_1) + 0.04032951(x_2) - 0.00016438(x_3))$$ Substitute the features vector$\vec{x}$with all input from our first observation. Note that all three models get the same$\vec{x}$input. $$\hat{y}_0 = g(-0.00533213 -0.314850877(66) + 0.28962001(35) + 0.00007962(75127))$$ $$\hat{y}_1 = g(0.00357773 + 0.088720320(66) - 0.71633982(35) + 0.00027894(75127))$$ $$\hat{y}_2 = g(0.00162194 + 0.103631441(66) + 0.04032951(35) - 0.00016438(75127))$$ Evaluate: $$\hat{y}_0 = g(-4.667177922)$$ $$\hat{y}_1 = g(1.74315053)$$ $$\hat{y}_2 = g(-4.096546364)$$ Express the logistic function: $$\hat{y}_0 = (\frac{1}{1+e^{--4.667177922}})$$ $$\hat{y}_1 = (\frac{1}{1+e^{-1.74315053}})$$ $$\hat{y}_2 = (\frac{1}{1+e^{--4.096546364}})$$ Double negatives become positive: $$\hat{y}_0 = (\frac{1}{1+e^{4.667177922}})$$ $$\hat{y}_1 = (\frac{1}{1+e^{-1.74315053}})$$ $$\hat{y}_2 = (\frac{1}{1+e^{4.096546364}})$$ Evaluate using the same steps from above: $$\hat{y}_0 = 0.00931...$$ $$\hat{y}_1 = 0.85109...$$ $$\hat{y}_2 = 0.01636...$$ But wait! These numbers do not match the python output we got: PRED: [1] PROB: [[0.01062205 0.97071551 0.01866244]]  What is going on here? We have two different$\hat{y}_1$'s. This is confusing, so let's use$\vec{\hat{a}}$to hold the probabilities we got from the math we just did ourselves, and$\vec{\hat{b}}$to hold the probabilities we got from python: $$\hat{a}_0 = 0.00931... \hspace{1cm} \hat{b}_0 = 0.01062...$$ $$\hat{a}_1 = 0.85109... \hspace{1cm} \hat{b}_1 = 0.97072...$$ $$\hat{a}_2 = 0.01636... \hspace{1cm} \hat{b}_2 = 0.01866...$$ We can see that the class 1 values are the biggest for both sets, but$\hat{a}_1$and$\hat{b}_1$are answering slightly different questions. $$\hat{a}_0 = \text{Does this belong to group 0?}$$ $$\hat{a}_1 = \text{Does this belong to group 1?}$$ $$\hat{a}_2 = \text{Does this belong to group 2?}$$ $$\vec{\hat{b}} = \text{Does this belong to group 0, 1 or 2?}$$ In other words, when we look at the$\hat{y}_1(\vec{w},\vec{x})$model by itself we are asking "$ \text{Does this belong to group 1?} $" and find there is a 0.85109 (85%) probability that the answer is "yes". This is the number$\hat{a}_1$. When we look at all three$\hat{y}(\vec{w},\vec{x})$models together we are asking "$ \text{Does this belong to group 0, 1 or 2?} $" and find there is a 0.97072 (97%) probability that the answer is "group 1". This is the number$\hat{b}_1$. This begs the question: how do we get from$\vec{\hat{a}}$to$\vec{\hat{b}}$? $$\hat{b}_1 = \frac{\hat{a}_1}{\sum\vec{\hat{a}}}$$ Or if you prefer: $$\hat{b}_1 = \frac{\hat{a}_1}{\sum_{i=0}^2 \hat{a}_i}$$ In order to get the bottom term of this fraction, we just sum all elements of our$\vec{\hat{a}}$vector together: $$\hat{b}_1 = \frac{0.85109}{(0.00931+0.85109+0.01636)}$$ By now$\sum\$ should feel like a familiar friend.

$$\hat{b}_1 = \frac{0.85109}{0.87676}$$

Easy peasy.

$$\hat{b}_1 \approx 0.97$$

## Conclusion

Assuming you haven't fallen into a math-induced coma, I hope you have found this article both useful and informative. If nothing else, I'd like to think this article helped a few fellow mathphobics overcome the fear of notation.

Of course we've only just scratched the surface. Both the LinearRegression() and LogisticRegression() functions can use many different cost functions and methods to estimate parameters with different strategies. We've yet to go into detail on any iterative methods, such as coordinate descent or stochastic gradient descent. Given enough time, I could easily write another 10,000 pages on this. If there is significant interest in what I've done here, maybe I will.

In the meantime, the next time you find yourself baffled by an intimidating white paper, just remember this: no matter how scary an algorithm may appear at first, underneath is just our old pal math. You've already gone from algebra through calculus in this article alone. Why stop here?

And never be afraid to ask a stupid question.

It is far more shameful to remain stupid than to admit you have more to learn.