Statistical Shenanigans (part 2)

by John Beamer
March 19, 2009

First a caveat: I’m afraid that this is another dull article about statistics and regression, so anyone here for some light-hearted historical commentary should turn away now.

There was a surprisingly positive reaction to my column a few weeks ago about the perils of correlation and regression—it appears that at least some THT readers are closet statheads. In this installment I take a look at how one can think about and interpret baseball analysis, specifically regression studies. This statistical tool is commonly used in sports analysis. Some of the lessons we discuss today also apply to non-regression techniques.

There are five things we’ll look:

First, how to determine that regression is the right tool for the job
Second, how to detect bias in sample data
Third, how to think through the regression equation to work out whether the dependent variables make sense
Fourth, how to interpret the results, in particular the coefficients
And fifth, what some watch-outs are when drawing conclusions

To help ease the pain we’ll use plenty of examples to illustrate the relevant points. Let’s get cracking.

1. The right tool

Irrespective of whether you are poring over a regression, an important question is whether the analyst chosen the best tool for the job.

Regression is a much overused tool that often favored by academics—especially economists—but not always the best approach. Regression analysis is easy and convenient: you have a bushel of data and are unsure of the interdependencies—a natural tendency is to chuck it into a regression and see what falls out. This approach usually causes odd results.

The first question to ponder is whether a regression analysis is required. A perfect illustration is the use of regression to calculate linear weights (LWTS). The LWTS equation we want looks something like this:

LWTS = a + b*1B + c*2B + d*3B +e*HR

The a,b,c,d coefficients represent the value of the various hitting events. Running this equation for 2002 game data gives the following coefficients:

a = -4.5 (this is average runs per game)
1b= 0.47
2b= 0.81
3b= 1.09
HR = 1.43

Number of game in regression = 2,426

Superficially, the events appear weighted correctly. But they are not—both doubles and triples (which are relatively less common) are overvalued—this is a foible of the regression. If you do this for different years you’ll see the less frequent events jump around in value quite a lot, which is nonsensical. The valuation for triples is especially nonsensical. In some years it is close to a home run in value while in others it is closer to a double. The triple’s rarity creates statistical noise in the equation, resulting in an error.

How do we know whether to use regression or a different tool?

The answer lies in having a deep understanding in whatever it is you’re trying to model. For instance, scoring in baseball depends on how many men are on base, which base these men are on, and how many outs are left. We can represent each base/out combination as a different state. If you have some mathematical background you’ll realize this could be modelled as a Markov process. On the other hand, thinking about baseball as “getting on base” and “moving runners over” will lead you to BaseRuns.

One tip is to identify how linked the outcome is to the dependent variables. In our LWTS example it is clear that hitting events are closely tied to offensive contribution, which speaks to a more fundamental (non-regression) model. Were we trying to link two things with no obvious link—say, payroll to wins—regression is probably better suited.

When a causal link is less obvious, regression probably will be a better right too. For instance, some of the work that Colin Wyers has been doing on wins and MRP. Regression should always be a second or third answer, never the first.

2. Bias in the Data

Biased data is one factor that will invariably lead to wonky conclusions—this affects all types of analyses, not just regressions. Selection bias happens a lot more frequently than you might imagine. Every time a study imposes an at-bat cut-off (or any arbitrary cut-off), bias is an issue.

Consider Jim Albert’s age-curve analysis. Jim’s study tries to work out age curves for ball players and tease out how it changes by decade of birth. Jim concludes that overall average peak age is 28.4 but for some decades, for example hitters born in the 1960s, it crept up to 30. That is substantially different from the oft quoted and generally accepted age peak of 27. What’s going on?

A 4.0 GPA isn’t required to work out that we have selection bias!

If you look at the criteria that Jim uses to select hitters he has a 5,000 career PA cut-off. This means that only very good hitters qualify for the study. In fact, we’d expect that the best hitters, those with 5,000 PA, probably have longer careers because either they peak later or don’t have a particularly steep drop-off.

A Hardball Times Update

by RJ McDaniel

Goodbye for now.

Tellingly, the number of players included in the study rarely rises above 100 per decade. Given that thousands of players register at-bats every season, players with at least 5,000 plate appearances are the elite of the elite. Jim’s study is not finding the age curve of your average hitter, but rather finding the age curve of a select group of uber-batters.

Players who are less good may peak earlier, or may have a steeper decline phase that dramatically alters the shape of the age curve. To account for this effect, a different study needs to be run. In fact, a rigorous analysis would cut the data by position, handedness, and talent (based on regressed OPS) to work out age curves for different player types.

This example is typical of many analyses. Remember that any criteria used to choose a sample, either before or after the data have been captured, has potential to introduce bias. If this happens the insight and conclusions must be caveated appropriately.

3. Making sense of the variables

An analysis using the wrong variables is about as useful as a credit card, sans credit. There are no shortcuts except common sense and logic. Here’s a slightly facetious example that proves that team OPS and team ERA have nothing to do with wins. Don’t believe me? Run the following regression (go on, do it):

Wins = a*winning% + b*teamOPS +c*teamERA

It will spit out:

Wins = 162*winning% + 0 *teamOPS + 0*teamERA

As promised teamOPS and teamERA are completely unrelated to wins!

The problem, as I’m sure you can see, is that winning% absorbs the effect of OPS and ERA—in other words, the model suffers from what statisticians pithily call multicorrlinearity. It is an obvious problem, but would still exist (more subtly) if we swapped winning% for runs scored and runs allowed. Someone who knew nothing about baseball would infer that OPS and ERA aren’t important.

This isn’t the only problem—the equation is tautological (wins=game*win%). Sometimes it is almost impossible to get around this issue.

A couple of years ago David Gassko penned an article about why teams outperform Pythagorean records, which contained this subtle flaw (sorry David). He concludes that one of the main factors is performance of the bullpen in close games—he uses a regression to calculate the effect, which he terms leverage.

However, he uses saves as an independent variable. Of course, teams that rack up more saves will win close games and outperform their pythag—by definition a save equals a win in a close game. Using a variable dependent on winning in close games to determine win difference from Pythag isn’t correct. A better approach would have been to use a metric like relative bullpen ERA, which isn’t polluted by the save stat. David, Guy and I had a good discussion on Ballhype about this.

Unfortunately there is no recipe to wheedle out unsuitable variables, but rigor and thought go a long way. I find it useful to apply three tests.

First: Look at the independent variables and ask whether they are correlated with each other, or indeed if they are closely and obviously linked to the dependent variable. If they are, omit some, or run a correlation and see.
Second: Systematically go through each variable and work out what effect it is testing. Try to build a counter argument to see how robust the hypothetical effect is—identify the weak spots in the approach and address them in your commentary.
Third: Take a blank sheet of paper and ask what over variables you’d possibly include. Refer back to the model to see if anything is obviously missing. Apply 1) and 2) above to make sure they will add new information to the model. If so include them; if not don’t.

Let’s try this approach by looking at a study that attempts to regress the independent variables below on attendance.

1) Games Behind (sum of GB for both teams)
2) Whether the game is a weekend game
3) Whether the game is a night game
4) Population of home city
5) Unemployment rate of home city
6) Per Capita Income of home city
7) Distance between the teams' two cities

Okay. Let’s apply the first test which is to work out whether any of the independent variables are correlated. Here are a few potential snares:

Sunday games tend to be day games, so there is a chance that variables 2) and 3) capture part of the same effect
The population of a city is likely to be related to per capita income (money attracts money), which means 4) and 6) could be ambiguous
Ditto for unemployment—a richer city will likely have lower unemployment, which throws 5) and 6) together

These points feel relatively minor so we’ll live with them for the time being—if I had to tinker I’d probably drop unemployment rate. Now let’s take apply the second test and look at the rationale for including each variable:

I’m not sure the sum of GB for both teams drives attendance. If a contending team is one game out of lead but their opponents are 30 then I’d bet that attendance would still be healthy. At the very least this needs to be separated into two variables.
Fans are more likely to flock to the yard when they aren’t working, i.e. at weekend or in the evening. Variables 2) and 3) make sense to me.
Cities that have more people or that are richer are likely to attract more fans—this accounts for 4), 5) and 6).
Distance between the two teams could be a bit of a red herring. Would Braves fans travel to Chicago but not San Francisco to watch their team play? Probably not—this doesn’t feel like a real effect for me

The final test is to think through whether there any are other variables missing (this is really is the realm of brainstorming). Here are a couple of ideas:

Ballpark age is likely to be a factor, especially in recent times
Ballpark amenities: Concessions, parking, proximity to public transport
Depth of fan base (there are Yankees fans everywhere)

Some of these are difficult to measure but the point is that by following a reasonably rigorous approach it is easy to pinpoint the strengths and weaknesses of a model. It’s amazing how far a little thought goes.

Arriving at the best regression takes time. The discerning analyst should try multiple combinations of different variables to see how coefficients and significance changes across different models. This will give a sense of what variables should be included and which omitted.

4. Results and interpretation

Exhaustion might be setting in, but our most important work is still ahead. Correctly interpreting the results is the trickiest part of any analysis, regression or otherwise. There are two steps: First, understand the structure of the regression equation to allow correct interpretation, and second, develop the insight from the results.

Understand the equation

If the regression equation is complicated (e.g. is a logarithm or a logit) then some mathematical gymnastics may be required to translate the coefficients into something meaningful.

Let’s take a look at a simple example. Suppose you saw the following correlation (more details here):

ln(player salary) = 3.681*OBP + 2.175*SLG + …..

The 3.681 coefficient is meaningless because the dependent variable is a logarithm. Say we want to find out the impact of a 0.100 OBP increase we have to take the exponent of 0.368, which gives 1.44. This means at extra .100 points of OBP increases salary by 44%. It is always helpful to read the notes to the equation so you know how the authors have presented their results.

Once you have a feel for what the coefficient means you can interpret the equation and start to test some of the underlying assumptions. For instance, take David Gassko’s DIPS 3.0 equation:

DIPS ERA = (-0.041*IF+0.05*GB+0.251*OF+0.224*LD+0.316*BB-0.12*SO+0.43*HBP)/IP*9

The GB coefficient, for instance, says that for every extra GB a pitcher gives up DIPS ERA increases by 0.05/IP*9. Think about that for a second. The assumption is that all hurlers give up ground balls in a similar way. If hurler A only gave up hard ground balls that found the gap while hurler B gave up soft worm burners that always lollygagged to the shortstop then we expect a difference in the GB contribution to ERA. However, although some pitchers can induce softer grounders than others, the effect is probably small and can be ignored.

Develop the insight

The next step is to develop the insight. This is largely predicated on deep understanding of what you are analysing—there is no substitute for expertise. A skeptical eye is critical, so always look for alternative explanations.

A good example is the Jim Albert paper we discussed earlier. If you cast you mind back you’ll remember that Jim concludes that peak age has increased in recent decades. The obvious explanation is better health, fitness and nutrition.

A good analyst won’t be satisfied with that. Are there any other reasons? Yes—absolutely. Jim defines peak age using linear weights per plate appearance. We know that over the last 15-20 years run scoring has increased. Could this be responsible for part of the age effect we are seeing?

If run scoring has become easier then we’d expect LWTS/PA to drift upwards over time—this would shift the age curve. Given the spate of power hitting since the 1990s this is a likely explanation. This explanation is equally as plausible as better nutrition.

Another example is a ludicrous paper on home field advantage (HFA) in baseball by a couple of professors at Georgia Southern University.

They model the propensity to win at home by looking a bunch of variables, including runs, runs squared (why?), one-run games, two-run games, and roster size (25 vs 40 man). The main conclusion is that HFA is more prevalent in close games (one or two run games) than in games where three or more runs are scored.

I’m sure you’ll agree that this conclusion seems a bit suspicious. Why should HFA evaporate in blowouts? It doesn’t make sense. Do the authors proffer any explanations? No. When there is no rational explanation for the result then that is a signal that something with the data or the study isn’t right.

A deeper look at the study reveals that the authors forgot about the impact of the bottom of the 9th and extra innings. This means that home teams are more likely to win by one or two runs than they are by three or more as, Tom Tango discovers.

There are no hard and fast rules on interpreting studies, and it is not uncommon for people to interpret the same results differently. It is often helpful to think through under what conditions the model would throw up exceptions. For instance, PrOPS, which is a regression model, underrates speedsters (they are more likely to turn a ground ball into a base) and overrates sluggards.

Check whether the results pass the sniff test and proceed to think up a bunch of possible counter arguments. Then use your baseball intuition and knowledge to pass judgement. If the answer is ambiguous, then either the wrong question was asked or more work is required. Issues normally lie hidden in the methodology or data selection so revisit those parts of the analysis for ideas of what may be wrong. Most important of all remain deeply dissatisfied.

5. Other watchouts

There are a couple of other tips and tricks worth knowing that we haven’t covered, namely statistical significance and effect size, and standardized coefficients.

Statistical significance and effect size

People bandy around correlation and regression as statistically significant without really understanding what it means. Significance is the confidence that we have in the results—and is calculated from standard errors. The higher the sample size, the lower the standard error so if we have a lot of data points it isn’t too difficult to demonstrate significance.

However, significance is irrelevant if the effect is small. Consider, say, a hypothetical study trying to work out a link between HFA and team OPS. Such a study may yield an equation like this, which has all coefficients significant at the 1 percent level:

Team_OPS = 0.001*Home_Team + …

Great. But check out the Home_Team coefficient. The equation says that we expect that HFA accounts only for one additional point of OPS (remember this is all fabricated). It may be significant but the effect is negligible to the point that it is not worth worrying about.

Sometimes effect size and significance work in the opposite direction too. An analyst can detect a strong effect that, because of data inadequacy, doesn’t appear significant. This is not an excuse to blindly disregard the effects fail our significance test—with different data or sample the effect may become valid. The trick is to keep an open mind and try to understand what is possible.

J.C. Bradbury’s DIPS study analyses which pitching components have the biggest impact on ERA. He does regressions for each year and concludes that, in general, BABIP is rarely statistically significant True, but a glace at table 6 shows that although BABIP coefficient is always negative and in many cases would be significant at the 10-20 percent level. This indicates that BABIP does loosely influence ERA, and indeed further DIPS studies have borne that out.

Standardized coefficients

Also in some analyses it is difficult to compare the relative impact of two variables. Imagine if you wanted ran a regression to see the impact of height and weight on the amount of food someone eats. From the equation that is spat out it is difficult to assess the relative impact of the two dependent variables because they are calculated in completely different units. Standardized coefficients adjust for this. A standarized coefficient will tell you that how many standard deviations the independent variable will change with a one standard deviation change in the dependent variable. It normalizes for both units and variance.

For instance, consider an equation that appeared in the THT 2007 annual that attempts to work out whether hitters or pitchers have more say on whether a pitch outcome is a groundball or not:

Match-up GB% = 0.67*hitters GB% + 0.33 *pitchers GB%

The author goes on to conclude that this means that hitters have a bigger influence on outcomes than pitchers do. However, this isn’t necessarily true. First, it isn’t anchored to the mean, as match-up GB% is a scale from 0-100, where 0 is scaled to the pitcher’s GB% and 100 to the hitter’s GB%. This means that every single batter pitcher match-up is using a different scale (based on their relative GB%). The right way to do this would have been to anchor the result around the mean—imagine a match-up between a .300 hitter (above average) and a .300 pitcher (below average) … our expected result is not .300. The correct representation is +0.03 for the hitter and -0.03 for the pitcher.

Also the higher coefficient for hitters could simply reflect (indeed, likely reflects) greater variance among hitters than pitchers. The hitter coefficient tells us that a 1 percent GB-rate increase for hitters adds 0.67 to match-up GB%—however, that does not imply that hitters have more “influence” on GB outcomes—just more variance. Imagine if we take a 100 such match-ups all on a scale between 0 and 100 (these marks are not anchored). If the hitter GB% has more variance, then we’d expect the results to be closer to 100 more often than not—this is what the regression found. The fact that we refer to groundball pitchers rather than groundball hitters is a big clue. Use of a beta coefficient, which adjusts for variance would correct this.

One final point is to be careful about what conclusions you draw. The old adage that correlation is not causality is very true. Doing this stuff it isn’t too difficult to look like a complete ass. That is something I try to avoid … and you should too!

In summary

Phew. That was tough work, right? Anyway, I hope you’ve learned a thing a two about regression and will cast an even more eagle eye over any baseball analysis you see in the future, regression or otherwise.

References & Resources
A hit tip to David Gassko, Tom Tango, MGL, Phil Birnbaum, Guy M and a host of others that have done a ton of work and rebuttal of bad regression techniques.

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG