Linear Regression

Discussion

introduction

Given a pile of data …

x1x2x3, … xn and y1y2y3, … yn

Find the line of best fit.

y = mx + b

We are shooting for a minimal amount of error as measured by the sum of the squares of the residuals. This method is called a least squares fit and is probably the most common form of best fit line.

 n  n
R2 =  yi)2 =  [(mxi + b) − yi]2 = minimum
i = 1 i = 1

This occurs where each of the partial derivatives is zero. (The limits on the summations will be omitted out of laziness from now on.)

 R2 = 2 ∑{[(mxi + b) − yi ] xi} = 0
m
and
 R2 = 2 ∑[(mxi + b) − yi ] = 0
b

After a bit of algebra, you get these equations …

m =  n∑(xiyi) − ∑xi ∑yi
n ∑(xi2) − (∑xi)2
and
b =  ∑(xi2) ∑yi − ∑xi ∑(xiyi)
n ∑(xi2) − (∑xi)2

and after a bit more algebra, these equations …

m =  ∑(xiyi) − n  
∑(xi2) − n ()2
and
b =   ∑(xi2) −  ∑(xiyi)
∑(xi2) − n ()2

How good is the line of best fit? Are some bests better than others? Here's one way to decide. Swap the explanatory and response variables.

x = m'y + b'

The slope of this new linear equation is the same as the old one with all the x's replaced by y's and vice versa. (Note that the numerator hasn't really changed.)

m' =  n∑(yixi) − ∑yi ∑xi
n ∑(yi2) − (∑yi)2

Now, multiply this slope by the old slope. Don't ask why, just do it.

m m' = 
n∑(xiyi) − ∑xi ∑yi ⎞⎛
⎠⎝
n∑(yixi) − ∑yi ∑xi
n ∑(xi2) − (∑xi)2 n ∑(yi2) − (∑yi)2

This product is known as the coefficient of determination

r2 =  (n∑(xiyi) − ∑xi ∑yi)2
(n ∑(xi2) − (∑xi)2) (n ∑(yi2) − (∑yi)2)

and its square root is called the coefficient of correlation.

r =  n∑(xiyi) − ∑xi ∑yi
√(n ∑(xi2) − (∑xi)2) √(n ∑(yi2) − (∑yi)2)