• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

Regression analysis

Badly Shaved Monkey

Anti-homeopathy illuminati member
Joined
Feb 5, 2004
Messages
5,363
Can someone briefly explain to me the connection between "regression to the mean" in the sense of extreme excursions tending over generations to return to nearer the mean value and "regression analysis" in the sense of fitting lines through scatter plots? I believe the maths of the former led to the technique for the latter but I don't know the conceptual link between them.

Pleae correct any misapprehensions as you see fit.
 
Francis Galton measured the heights of both children and their parents. He then plotted the parents' height (the average of the two parents' heights) against the child's height as a scatter plot. He then divided the parent heights into non-overlapping ranges, and calculated the children's mean heights for each range. Then he drew a best fit line through the data. He discovered two things:

1. There was a relationship between the height of the parents and the height of child - tall parents tended to have tall children, and short parents tended to have short children.

2. Despite this relationship, children of very tall parents tended to be slightly shorter than their parents, and similarly, the children of small parents, while small, were not quite as small.

Galton called this second observation "regression to the mean".
 
I'm no mathematician but isn't it basically just a case of what period you draw your average (or mean) over - for both examples?

Regression to the mean is implicit since your mean always includes all known data.
 
It's purely a statistical artefact - Galton thought he had discovered something to do with heredity, but if he had measured the relationship in the other direction, he would have seen the same effect.
 
There are lots of other cases of regression to the mean.

Perhaps you have a chronic illness (e.g. arthritis) that gets better and worse. It finally gets so bad you go the doctor, chiropractors or takes some (fake or real) cure. In any case, it will probably get better and the average person will praise the doctor, chiropractor or cure. The reality is the symptom was at its worst and it would regress to the mean (get better) regardless of what you did. The old joke is that you take aspirin and get better in two weeks or you can do nothing and get better in 14 days.

Perhaps you are an athlete who has just had your best performance ever. It could be that this was a lucky occurence never to be repeated and you will regress to the mean next race/game/season. It could also be that you have been practicing more and are getting better.

CBL
 
Hence, the "Sports Illustrated Jinx", where athletes refuse to appear on the cover.
(Thanks to Thomas Gilovich, from How We Know What Isn't So)
 
I always thought "regression to the mean" was when trolls run out of arguments and start name-calling...
 
Look up stuff like Stein shrinkage estimator. The basic idea is that you do better overall at predicting individual scores by weighting naive estimates towards the group mean. I just saw a low level talk by Bradley Efron illustrating this (it was a general talk on empirical Bayes) with baseball to hit averages (or whatever the hit/turns at the bat ratio is called). For example, consider predicting season to hit average of a set of players. You can take early season average, shrink the scores towards the current mean (there's a formula obviously), and then when you compare at the end of season, your predicted scores will be closer to the truth then just using each player's early season average. The paradox is that group mean seems to affect individual performance, when it obviously isn't the case. It's a partial or small amount of data compensator in a way, and yes it's mathematically justified...
 
So, is JamesM's answer really complete? It seems a bit back to front that the "regression line" was drawn then the concept of regression to the mean was noticed. What I mean is that the word "regression" is an odd one to choose: what is regressing? Regression to the mean describes regression of one variable over time. A regression line plots a slope of one variable against another in contemporaneously collected data sets.

It's a semantic question rather than a maths question. Why was the word regression chosen to describe the line-fitting exercise?
 
Can someone briefly explain to me the connection between "regression to the mean" in the sense of extreme excursions tending over generations to return to nearer the mean value and "regression analysis" in the sense of fitting lines through scatter plots? I believe the maths of the former led to the technique for the latter but I don't know the conceptual link between them.

There really isn't that much of a conceptual link. It's one of those things from history that I think is kind of confusing.
 
I hadn't thought to put Galton into Google with regression. That's been rewarding;

http://www.jmp.com/news/jmpercable/06_summer1998/regression.html

"He thought he had made a discovery when he found that the heights of the children tended to be more moderate than the heights of their parents. For example, if parents were very tall the children tended to be tall but shorter than their parents. If parents were very short the children tended to be short but taller than their parents were. This discovery he called "regression to the mean," with the word "regression" meaning to come back to.



However, Galton's original regression concept considered the variance of both variables, as does orthogonal regression, which is discussed later. Unfortunately, the word "regression" later became synonomous with the least squares method, which assumes the X values are fixed."

This at least makes it clear that it is poor terminology even though it is probably now unshiftably embedded in our jargon.
 

Back
Top Bottom