Linear Regression
Discussion
the concepts
Keywords: linear relationship, linearly related, linear regression, line of best fit, best fit line, least squares fit, coefficient of determination, coefficient of correlation, … ?
When two quantities are directly proportional or directly related…
y ∝ x
…their ratio is a constant.
y | = a constant |
x |
When two quantities are linearly related, they are not quite directly proportional. It's not their values that are proportional, but the rate of change in their values that are proportional.
∆y ∝ ∆x
The ratio of these quantities is a constant that should be familiar to you. On a graph of a straight line this ratio is known as the slope.
∆y | = slope |
∆x |
The symbol for this ratio is the letter m, probably because m is the first letter in the word slope.
∆y | = m |
∆x |
A graph of a direct relationship is a straight line that runs through the origin. When x is zero y is zero. A graph of a linear relationship is a straight line that may or may not run through the origin. When x is zero, y could be zero or it could be something else. A linear relationship is one that is partly direct and partly constant. That constant is known as the y intercept, which is indicated using the brilliantly chosen letter b. Altogether as an equation…
y = mx + b
the mathematics
y = mx + b
Given a pile of n points of 2 dimensional data…
x1, x2, x3, … xn
y1, y2, y3, … yn
Find an equation for the line of best fit.
y = mx + b
We are shooting for a minimal amount of error in the residuals — the distance from the data point to the line measured in the y direction. If we look at the sum of the squares of the residuals (identified using the symbol R2) the method is called a least squares fit and is probably the most common way to compute a best fit line.
n | ||
R2 = | ∑ | (∆yi)2 |
i = 1 | ||
n | ||
R2 = | ∑ | [(mxi + b) − yi]2 |
i = 1 |
This expression has its minimum where the partial derivatives with respect to m and b are both zero. (The limits on the summations will be omitted out of laziness from now on.)
∂ | R2 = 2 ∑{[(mxi + b) − yi ] xi} = 0 |
∂m |
∂ | R2 = 2 ∑[(mxi + b) − yi ] = 0 |
∂b |
After a bit of algebra, you get these equations…
m = | n ∑(xiyi) − ∑xi ∑yi |
n ∑(xi2) − (∑xi)2 |
b = | ∑(xi2) ∑yi − ∑xi ∑(xiyi) |
n ∑(xi2) − (∑xi)2 |
Where…
n = | number of data points |
∑xi = | sum of the x values |
∑yi = | sum of the y values |
∑(xi)2 = | sum of the x2 values |
∑(xiyi) = | sum of the xy products |
m and b sample calculation
Here's a small data set. Determine its line of best fit the hard way. Pretend you don't have a calculator or spreadsheet app that can figure this out in one step. Pretend you had to actually use the equations given in this section.
x | y |
---|---|
10 | 08.04 |
08 | 06.95 |
13 | 07.58 |
09 | 08.81 |
11 | 08.33 |
14 | 09.96 |
06 | 07.24 |
04 | 04.26 |
12 | 10.84 |
07 | 04.82 |
05 | 05.68 |
Add a column for x2 and xy and compute all those values.
x | y | x2 | xy |
---|---|---|---|
10 | 08.04 | 100 | 080.40 |
08 | 06.95 | 064 | 055.60 |
13 | 07.58 | 169 | 098.54 |
09 | 08.81 | 081 | 079.29 |
11 | 08.33 | 121 | 091.63 |
14 | 09.96 | 196 | 139.44 |
06 | 07.24 | 036 | 043.44 |
04 | 04.26 | 016 | 017.04 |
12 | 10.84 | 144 | 130.08 |
07 | 04.82 | 049 | 033.74 |
05 | 05.68 | 025 | 028.40 |
Total up each column.
x | y | x2 | xy |
---|---|---|---|
10 | 08.04 | 100 | 080.40 |
08 | 06.95 | 064 | 055.60 |
13 | 07.58 | 169 | 098.54 |
09 | 08.81 | 081 | 079.29 |
11 | 08.33 | 121 | 091.63 |
14 | 09.96 | 196 | 139.44 |
06 | 07.24 | 036 | 043.44 |
04 | 04.26 | 016 | 017.04 |
12 | 10.84 | 144 | 130.08 |
07 | 04.82 | 049 | 033.74 |
05 | 05.68 | 025 | 028.40 |
∑x | ∑y | ∑x2 | ∑xy |
99 | 82.51 | 1001 | 797.60 |
Substitute numbers into equations and be done: first the slope m…
|
|||
|
|||
|
and then the y intercept b.
|
|||
|
|||
|
the coefficients of determination and correlation
How good is the line of best fit? Are some bests better than others? Here's one way to decide. Swap the explanatory and response variables.
x = m'y + b'
The slope of this new linear equation is just the old one with all the x's replaced by y's and vice versa. (Note that, because multiplication is commutative, the numerator hasn't really changed.)
m′ = | n ∑(yixi) − ∑yi ∑xi |
n ∑(yi2) − (∑yi)2 |
Now, multiply this new slope by the old slope. Don't ask why, just do it.
m m′ = | ⎛ ⎜ ⎝ |
n ∑(xiyi) − ∑xi ∑yi | ⎞⎛ ⎟⎜ ⎠⎝ |
n ∑(yixi) − ∑yi ∑xi | ⎞ ⎟ ⎠ |
n ∑(xi2) − (∑xi)2 | n ∑(yi2) − (∑yi)2 |
This product is known as the coefficient of determination
r2 = | (n ∑(xiyi) − ∑xi ∑yi)2 | |
(n ∑(xi2) |
(n ∑(yi2) |
and its square root is called the coefficient of correlation.
r = | n ∑(xiyi) − ∑xi ∑yi | |
√(n ∑(xi2) |
√(n ∑(yi2) |
r2 and r sample calculation
Continue using the sample data set. Add a column for y2 and determine its sum.
x | y | x2 | xy | y2 |
---|---|---|---|---|
10 | 08.04 | 100 | 080.40 | 064.6416 |
08 | 06.95 | 064 | 055.60 | 048.3025 |
13 | 07.58 | 169 | 098.54 | 057.4564 |
09 | 08.81 | 081 | 079.29 | 077.6161 |
11 | 08.33 | 121 | 091.63 | 069.3889 |
14 | 09.96 | 196 | 139.44 | 099.2016 |
06 | 07.24 | 036 | 043.44 | 052.4176 |
04 | 04.26 | 016 | 017.04 | 018.1476 |
12 | 10.84 | 144 | 130.08 | 117.5056 |
07 | 04.82 | 049 | 033.74 | 023.2324 |
05 | 05.68 | 025 | 028.40 | 032.2624 |
∑x | ∑y | ∑x2 | ∑xy | ∑y2 |
99 | 82.51 | 1001 | 797.60 | 660.1727 |
Substitute and calculate to get r2, the coefficient of determination.
|
|||||
|
|||||
|
Take the root of this to get r, the coefficient of correlation. Use the positive root since the line of best fit has a positive slope.
r = +√r2 r = +√(0.667) r = +0.816 |