But I Regress…

Do you know that thing that statisticians do called regression analysis? It’s when they look at two (or more) numbers to determine how closely correlated they are. To use a couple of examples I’ve seen recently, education is correlated with health and the presence of a Led Zeppelin bumper sticker is correlated with the likelihood of that vehicle containing a controlled substance like marijuana. I first learned regression analysis back in the days when you had to compute it by hand; now all you need is a computer with Excel. It’s a neat tool, perhaps a bit too easy to use for some.

But something’s always bugged me: why is it called regression analysis? Why isn’t it called correlation analysis? I mean, when you run a regression analysis, the main output is the correlation between the variables, right? So why is it called regression? Huh? Haven’t you wondered the same thing? Even once?

Okay, perhaps you’re not as geeky as I am. But you’ll be happy to know that I think I found the answer while reading a biography of the guy who invented regression analysis, Sir Francis Galton.

Galton was an amazing, quirky guy; one of those classic Victorian gentlemen with lots of time on their hands and lots of things to discover. He traveled the Nile and explored parts of Africa that hadn’t been seen by white men before. He published a book on survival in the wild, parts of which are still included in survival guides. He invented some silly things (one of my favorites: the gumption-reviver machine, which simply dripped water on you until you were thoroughly soaked) and some very important things (weather maps; the system for categorizing fingerprints still used today). Most of all, he counted things.

Galton was an obsessive counter. He determined a precise formula for preparing the perfect cup of tea. He counted beautiful women in different parts of England to deduce his own “beauty map.” And when his cousin, Charles Darwin, invented a little something called evolution, he threw himself into the task of counting hereditary traits.

He was convinced that things like criminal behavior, intelligence and genius were linked to heredity. His beliefs stood in contrast to many of his critics, who also cited environment. In fact, it was Galton who first turned the phrase “nature/nurture” to describe the argument. Along the way, he decided the best thing to do would be to collect statistics on people and measure them. So he set up shop in a Public Health exhibition and asked people if they would like to be measured (height, armspan, breathing capacity, eyesight, etc.). After a year, he had collected measurements on over 10,000 people.

Statistics was still in its infancy, and Galton certainly didn’t have a computer back then. But he decided to analyze these numbers as best he could. He took the heights of 205 sets of adults and their children and (much to my delight) laid them out in a scatterplot graph. He saw that the points moved together: the taller the parents, the taller the children. However, the points didn’t line up perfectly.

So he drew a line that seemed to best fit the relationship between the points, and measured its slope. The result was two-thirds. As Galton thought it through, he realized that children were two-thirds as likely to be as “extreme” as their parents. He called the remaining one-third “regression.” Actually, he called it “regression to mediocrity,” which we have modified to regression to the mean.

This was actually a blow to Galton, who wanted to believe that heredity was absolute. But it was a huge step forward for the field of statistics. Galvin went on to refine his technique, developing correlation coefficients and lots of other things. But the very first thing he noticed, the thing that the graph showed him, was regression. And that’s why we call it regression analysis. I think.

Regression to the mean is everywhere in baseball. Sophomore slump? Regression to the mean. Seattle’s 93-69 record after going 116-46 in 2001? Regression to the mean. Luke Scott’s Slugging Average in 2007? Regression to the mean.

Let me show you another graph. This graph plots batting average in 2005 and 2006. What I’ve done is to split up the 2005 batters into quartiles, and then plotted how those same batters performed in 2006. I used a minimum of 300 at bats in 2005 and included the player in in both years if he played in 2006 at all. This is what regression to the mean looks like:

image

As you can see, each one of the four quartiles moves closer to the average (that gray line) in 2006. The first quartile of batters batted .305 in 2005 and .294 in 2006. The lowest quartile batted .245 in 2005 and .263 in 2006. Each group moved closer to the mean.

There is probably some selection bias in that lower quartile. The worst batters played less in 2006, which skews the overall results higher. So regression to the mean isn’t quite as strong as it appears in that lower quartile, but it’s still pretty strong.

What we’re really after is understanding the difference between a player’s “true talent” and the overall league average. The problem is that one year isn’t enough data to establish a player’s true talent. So let’s see what happens when we include two year’s batting average (2004 and 2005) in the initial quartiles:

image

If you compare the two graphs, you’ll see that the lines aren’t as steep when you have two years’ worth of data to begin with. In this case, the first quartile moved from .303 in 2004/05 to .295 in 2006, a little less than the one-year sample. The bottom quartile migrated from .252 to .262, a lot less than the one-year sample. If you have more years in your baseline, there is less regression to the mean.

Why do I bring this up now? Because lots of people are producing forecasts for the 2007 season, and one of the first things every decent projection system will do is regress a player’s performance to the mean. In fact, there is one system that does nothing other than regress each player’s performance to the major league average as a basis for its 2007 projection. It’s called Marcel, because it’s so simple that even a monkey can do it. (Marcel, from Friends. Get it?)

A Hardball Times Update
Goodbye for now.

You can read more about the Marcel system from its current caretaker, Tangotiger. Tango’s specific calculations are laid out in this thread—he essentially takes each player’s previous major league performance and regresses it to the mean. That’s it; no park adjustments, minor league stats or anything like that. The amount to which he regresses each player depends on how long the player has been in the majors. If he’s only been in the majors a year or two, Tango regresses his performance a lot. He also regresses a pitcher’s performance more strongly than a batter’s, because pitchers are typically more random.

Chone/Sean Smith found that Marcel had a .66 correlation with batters’ actual performance last year. The best correlation he found was PECOTA’s, at .74. Nate Silver of Baseball Prospectus has worked tremendously hard to make PECOTA a cutting-edge system and has succeeded. But even his model only gains a smidgen of accuracy over Marcel. That is the power of simple regression to the mean.

You can download the 2007 Marcel projections from Tango’s site. Just for the heck of it, I downloaded them and compared them to each player’s 2006 performance. Here is a list of the batters who are most likely to see an increase in their batting average, based on Marcel and regression to the mean (minimum at bats in 2006: 300. Minimum batting average in 2006: .240):

Last      First       06BA     mBA    Diff
Gonzalez  Luis A.     .242    .285    .043
Cantu     Jorge       .249    .281    .032
Izturis   Cesar       .245    .276    .031
Ellis     Mark        .249    .278    .029
Mueller   Bill        .252    .279    .027
Duffy     Chris       .255    .281    .026
Kubel     Jason       .241    .266    .026
White     Rondell     .246    .271    .025
Crisp     Coco        .264    .289    .025
Casey     Sean        .272    .296    .024
Lopez     Javy        .251    .276    .024
Peralta   Jhonny      .257    .280    .024

In general, you won’t see many predicted improvements for first- or second-year players, because there’s not enough history to regress to. But Cleveland fans should feel good about seeing Jhonny Peralta on this list.

Here’s a list of players whose batting averages are most likely to decline next year:

Last      First     06BA     mBA    Diff
Redmond   Mike      .341    .291   -.050
Scott     Luke      .336    .292   -.044
Bard      Josh      .333    .293   -.041
Ozuna     Pablo     .328    .290   -.038
Ward      Daryle    .308    .269   -.038
Cirillo   Jeff      .319    .281   -.038
Jones     Chipper   .324    .286   -.037
Helms     Wes       .329    .293   -.036
Coste     Chris     .328    .294   -.034
Jeter     Derek     .343    .311   -.033

You shouldn’t really be surprised by any of the players on this list. Let’s switch to On-Base plus Slugging Average (OPS). Here’s a list of players most likely to improve next year by regressing to the mean:

Last     First      06OPS    mOPS    Diff
Clark    Tony        .643   0.826    .183
Gonzalez Luis A.     .625   0.764    .139
Guillen  Jose        .674   0.800    .126
LaRue    Jason       .663   0.763    .101
Peralta  Jhonny      .708   0.803    .095
Lee      Derrek      .842   0.934    .092
Cantu    Jorge       .699   0.789    .090
Lopez    Javy        .683   0.767    .084
Hermida  Jeremy      .700   0.782    .082
Niekro   Lance       .673   0.754    .082
Crisp    Coco        .702   0.783    .081
Varitek  Jason       .725   0.806    .080
Navarro  Dioner      .687   0.767    .080

Here’s a list of players most likely to decline:

Last       First      06OPS    mOPS    Diff
Scott      Luke       1.047   0.872   -.175
Ward       Daryle      .926   0.782   -.144
Ross       Dave        .932   0.788   -.144
Helms      Wes         .965   0.831   -.134
Dye        Jermaine   1.006   0.879   -.128
Thome      Jim        1.014   0.900   -.114
Beltran    Carlos      .982   0.875   -.107
Anderson   Marlon      .866   0.765   -.102
Bard       Josh        .926   0.826   -.100
Saenz      Olmedo      .927   0.828   -.099

Is Marcel saying that each of these players will regress to the mean? Absolutely not. Some of them won’t. But enough of them will regress to the mean to validate the entire approach. Marcel doesn’t predict breakout seasons; by definition, those are nearly unpredictable. It predicts what you can most likely expect from a player.

Projection systems start with regression to the mean, but they differ significantly in what they regress to. Marcel simply regresses to the overall major league average (with one exception for pitchers in the American League), while PECOTA regresses to the average of similar players (based on height, weight and other things). As another example, this thread includes a fine discussion of how to regress players who have only been in the majors a year or two.

Sir Francis Galton would be proud of the way baseball fans and analysts have incorporated regression to the mean in their thinking. I can also think of a few players who could use that gumption-reviver machine.

References & Resources
The biography of Galton is called Extreme Measures: The Dark Visions and Bright Idesa of Francis Galton by Martin Brookes. The New Yorker reviewed the book a couple of years ago.

Correlation and regression analysis were a tremendous contribution to mankind, but Galton’s other legacy is the field of eugenics. Galton envisioned eugenics as a utopian way to build the best human species. In his conception, eugenics was relatively innocent and naive. Adolf Hitler turned eugenics into a nightmare.

I want to credit John Burnson’s 2006 Graphical Pitcher for the graphical inspiration of regression to the mean. John used it to show the extreme regression to the mean of home runs per fly balls among pitchers.


Dave Studeman was called a "national treasure" by Rob Neyer. Seriously. Follow his sporadic tweets @dastudes.

Comments are closed.