# Regression with Changing Talent Levels: The Effects of Variance

In my recent article comparing Elo and regression to the mean, I noted that weighting past results with an exponential “decay factor” so that more recent data gets more weight can impact the regression constant used to estimate true talent. Specifically, the effect was as follows:

It turns out the correct amount of regression to use is roughly half the original regression constant (actually n/(2n-1), where n is the maximum effective weight of the sample based on the decay factor, which is close to 1/2 as long as n isn’t too small).

There wasn’t space in the article to lay out the math behind that statement, but I went through it in a comment on TangoTiger’s blog for those interested.

The n/(2n-1) formula works well for creating a regression model that emulates Elo, which was the purpose of the previous article, but for practical purposes, the impact of weighting past results on the mathematics of regression to the mean is a bit more complicated than that.

There are two important caveats that are critical to understanding why the n/(2n-1) formula isn’t necessarily what we want to use in practice.

First, n/(2n-1) is a long-term limit. Applying an exponential decay to past results places a cap on the total effective sample size. No matter how much data you have, the total weight will never go above a certain point (because this type of weighting creates a geometric series).

As the effective sample size approaches the limit n, the regression constant approaches the limit C * n/(2n-1), where C is the original regression constant. With the sample sizes we typically work with for projecting baseball teams or players, though, the regression constant generally doesn’t have enough time to reach the long-term limit.

Second, the math behind that formula relies on the simplified assumption that talent levels are constant for each team or player in the sample. In reality, the need to weight past results implies the underlying talent levels are changing, which can have additional implications on the regression constant that make this a lot messier. This wasn’t an issue with trying to emulate Elo, which gives an un-regressed estimate, but for creating an actual regression-based model, it will be.

This second point is especially important. In the math linked in the comment on Tango’s blog, the takeaway is that the regression constant is linked closely with the amount of variance in the sample. The reason the regression constant changes when you weight for recency is that doing so distorts the amount of variance.

This is important because the complications involved with changing talent levels also center around distorting the amount of variance in the sample. This means that, while we could adjust the n/(2n-1) formula to work for any sample length and not just as a long-term limit, it would still have problems due to the distortions introduced by changing talent levels.

### General Principle No. 1

*The regression constant is closely related to the amount of variance in your sample. Factors that can affect the variance, such as weighting past results or changes in talent levels over time, can produce corresponding changes in the regression constant.*

#### Estimating the Regression Constant

The basic method to calculate the regression constant is to split your sample into two halves and measure the correlation between them. If you’ve ever seen an article on how long it takes particular stats to “stabilize” (such as Russell Carleton’s or Derek Carty’s) the author was likely doing something along these lines.

There are a few other ways to do this besides just splitting the sample in half and running a correlation (such as intraclass correlation, Kuder-Richardson-20/KR-21 and Cronbach’s alpha*), but the general idea is the same: You use either correlation or variance to calculate a reliability score for your sample, and the reliability tells you how much to regress. In general, the reliability goes up as your sample size increases, and the regression constant represents the point at which the reliability of the sample reaches 0.5.

**I believe Carleton’s second article underreports the plate appearance thresholds by a factor of 1/2. It appears from the article that he interprets the reliability of a 500 PA sample as the split-half correlation with 250 PAs in each half, but the formulas he is using—KR-21 and Cronbach’s alpha—both give reliability for the full sample length, not half. If that is the case, it would help explain a noted discrepancy between Carleton’s and Carty’s numbers.*

If terms of correlation, the reliability score (for half the sample length) is the correlation between the two halves of your sample, and the reliability will be 0.5 when the number of observations in each half equals the regression constant. In terms of variance, it is the proportion of total variance in the sample that comes from variance in talent rather than random variance, and the reliability will be 0.5 when an equal amount of variance comes from both talent and random variation.

The fact that we can use correlation or variance interchangeably is helpful because one framework might be better suited to a particular discussion. In our case, focusing on the variance of talent in the population will help us understand how the distortions in variance caused by changing talent levels can impact the regression constant.

### General Principle No. 2

*The regression constant represents the number of observations you need for the reliability of your sample to reach 0.5. The reliability of the sample can be expressed (and calculated) in terms of either correlation or variance, and the two approaches can be interchanged depending on which is more convenient.*

#### Variance in the Sample

The relationship between variance and regression is based on a simple concept: The wider the distribution of talent across the league, the more players you have whose actual talent level is a given distance from average. And the more players whose true talent is a given distance from average, the more likely an observed performance at that level is due to talent rather than random variation.

As a result, a wider spread in talent translates to a lower regression constant. Mathematically speaking, this increases the proportion of observed variance that comes from talent as opposed to random variation, thus increasing the reliability of the sample. And of course the reverse is also true; a narrower spread in talent will lead to more regression.

When talent is constant for each player throughout your sample, the variance in talent won’t ever change. What about when talent levels start changing from day to day, though?

To start, let’s look at a simple hypothetical league of 11 hitters whose true talent in on-base percentage is spread evenly from .300 to .350. This is obviously an unrealistic distribution for major league baseball, but keeping things simple will help us see what happens when we introduce random changes in talent.

Now, let’s say that each hitter’s talent level changes randomly each day, and that those changes follow a normal distribution with a standard deviation of .001. This doesn’t change much on a day-to-day level, but what happens as these small changes start to add up?

To show this, I’ve simulated 100 days for our hypothetical league:

Player | Original | After 1 day | After 100 days |
---|---|---|---|

Player 1 | 0.3000 | 0.2999 | 0.3142 |

Player 2 | 0.3050 | 0.3049 | 0.3053 |

Player 3 | 0.3100 | 0.3088 | 0.2960 |

Player 4 | 0.3150 | 0.3148 | 0.3104 |

Player 5 | 0.3200 | 0.3199 | 0.3125 |

Player 6 | 0.3250 | 0.3246 | 0.3193 |

Player 7 | 0.3300 | 0.3309 | 0.3520 |

Player 8 | 0.3350 | 0.3355 | 0.3509 |

Player 9 | 0.3400 | 0.3405 | 0.3543 |

Player 10 | 0.3450 | 0.3433 | 0.3264 |

Player 11 | 0.3500 | 0.3506 | 0.3517 |

The small changes become more substantial over time, but more importantly, something happens to the spread in talent across the sample. The variance of true talent started at 0.000275, but after 100 days of random changes in talent, that jumps to 0.000470.

Of course, with an 11-player sample, that jump could just be random chance, but even if we repeat the simulation with more players, we see the same thing. A jump to 0.000470 is a bit higher than expected (the expected variance in talent after 100 days under these conditions would be around 0.000375), but the fact that it increases remains.

This is because the original talent levels and the random changes in talent are uncorrelated. When you combine two uncorrelated variables, their variances also combine. In other words, we would expect the spread in talent after 100 days to approximately equal the variance in original talent (0.000275) plus the variance of 100 days worth of random talent changes (100 * 0.001^2 = 0.0001).

This is a problem. The implication is that, if changes in talent are uncorrelated with talent levels, the spread in talent across the population should be constantly increasing. Which, if you think about it, makes sense. If the players at the extreme ends of the talent spectrum are just as likely to get even better or even worse as they are to move toward the center, then those extreme ends will drift further and further apart as half the top players keep getting better and half the bottom players keep getting worse.

But this doesn’t match what we see. The spread in talent in the majors doesn’t appear to be constantly increasing like we’d expect from this model.

Our model, then, needs an update. We can correct for the constantly increasing spread in talent by constantly resetting the variance back to its original level. This is done by moving each player’s new talent level slightly toward the center of the distribution after each talent change.

This has the effect of creating a slight negative correlation between a player’s talent level and subsequent changes in talent—that is, players at the top end of the talent spectrum are more likely to fall back toward the center than to continue moving further away—which cancels out the effect of the additional variance.

Player | Original | After 100 days | Variance Corrected |
---|---|---|---|

Player 1 | 0.3000 | 0.3142 | 0.3168 |

Player 2 | 0.3050 | 0.3053 | 0.3099 |

Player 3 | 0.3100 | 0.2960 | 0.3028 |

Player 4 | 0.3150 | 0.3104 | 0.3138 |

Player 5 | 0.3200 | 0.3125 | 0.3154 |

Player 6 | 0.3250 | 0.3193 | 0.3206 |

Player 7 | 0.3300 | 0.3520 | 0.3457 |

Player 8 | 0.3350 | 0.3509 | 0.3448 |

Player 9 | 0.3400 | 0.3543 | 0.3474 |

Player 10 | 0.3450 | 0.3264 | 0.3260 |

Player 11 | 0.3500 | 0.3517 | 0.3454 |

Variance | 0.000275 | 0.00047 | 0.000275 |

This sounds similar to regression to the mean itself, and in fact, the calculations to offset the growth in talent variance are done the same way as the calculations used in regressing observed results. However, regression to the mean is a purely statistical phenomenon that occurs even without changes in talent. It tells us that a player’s true talent level tends to be closer to the mean than his observed performance, not that his talent level itself is changing in any way.

What this suggests is that, given that talent levels are changing from day to day, the underlying talent levels themselves will also tend to drift toward the center going forward.

### General Principle No. 3

*When there are changes in talent over time, one of two things should happen: Either the spread in talent across baseball will increase steadily over time, or individual player talent levels will tend to drift back toward the center. If there is no evidence of the former occurring, then we can assume the latter must be true.*

#### How Talent Changes Affect Variance

When we look at observed performance, there is a wider spread in performance over the short term than the long term. For example, a month into this season, there were a handful of hitters with wOBAs over .500, and another handful of everyday starters with wOBAs under .200. By the end of the season, those extremes will have narrowed considerably as the players on the extreme ends of the spectrum drift toward the center.

If true talent also tends to drift toward the mean, then we would expect to see something similar happen with the underlying talent levels. And if we take the average talent level for each player in our example over the entire 100 days rather than just looking at talent one day at a time, we do indeed see the spread narrow:

Player | Original Talent | Average Talent |
---|---|---|

Player 1 | 0.3000 | 0.3108 |

Player 2 | 0.3050 | 0.3085 |

Player 3 | 0.3100 | 0.3090 |

Player 4 | 0.3150 | 0.3153 |

Player 5 | 0.3200 | 0.3139 |

Player 6 | 0.3250 | 0.3205 |

Player 7 | 0.3300 | 0.3432 |

Player 8 | 0.3350 | 0.3419 |

Player 9 | 0.3400 | 0.3458 |

Player 10 | 0.3450 | 0.3353 |

Player 11 | 0.3500 | 0.3466 |

Variance | 0.000275 | 0.000257 |

The variance in talent among players stays at .000275 for any particular day, but when we combine the 100 days and take each player’s average talent level over that span, the variance shrinks. Even though the variance in talent across the league remains the same from one day to the next, the players who are near the top or bottom of the league in true talent on any given day are more likely to be less extreme throughout the rest of the sample, on average.

As these players drift back toward the center and new players replace them at the extreme ends of the spectrum, nobody averages out as wide as the day-to-day extremes over the whole sample. The longer you observe the sample and the more time you give players’ true talent levels to change, the smaller the spread gets between them.

This reaches the crux of how changing talent levels impact regression. Decreased talent variance means lower reliability, and lower reliability means more regression.

In our example, the regression constant for our 11-player population would be approximately 800 PA. (The random binomial variance after 800 PA is ~.000275, which means that’s the point where the total observed variance is half random and half true talent variance.) But that is only if we are projecting talent for the immediate future. If we are instead projecting talent over the next 100 days, where we expect the variance in talent to tighten, suddenly we have to add ~850 PA of regression (the point where random binomial variance equals .000257).

The regression constant is generally treated as the same value whether you have one plate appearance or a thousand (hence the term “constant”), which works well when talent levels never change. When talent levels are changing from day to day, though, that no longer holds.

### General Principle No. 4

*The regression constant is generally treated as a constant value regardless of how much observed data you have, but changes in talent over time can cause the regression constant to increase as the variance in talent across the population shrinks.*

#### Calculating Changes in Variance Over Time

We know that when talent levels change from day to day, the spread in talent across the league can shrink as your sample length grows. But how exactly is that relationship defined? How do we know how much the variance in talent shrinks?

It depends on two factors: the length of time covered by your sample, and how much talent changes from day to day (as measured by the correlation between talent levels from one day to the next). The longer your sample length, or the more talent changes from day to day, the more the spread in talent will narrow.

Once we know these two factors, we can calculate how much the spread in talent diminishes using the following formula:

Again, there isn’t space in this article to cover why this formula works, but I’ve written a supplement going through the math for those interested. If you prefer, you can also download the mathematical explanation in PDF form along with some R code to test the formula against simulated data.

In our 11-player example, the day-to-day correlation in talent is about 0.998. If we plug d=100 and r=0.998 into the above formula, we get that the variance in talent after 100 days should be about 94 percent of the original variance in talent. That comes out to about 0.000259, or right around what we got in our sim.

### General Principle No. 5

*The amount that variance in talent is reduced over time depends on two factors: the length of time covered by your sample and the correlation between talent levels from one day to the next.*

#### Estimating the Day-to-Day Correlation

This formula allows us to calculate the expected variance in talent at any sample length, which in turn lets us adjust the regression constant to any sample length. If we have data covering a full season but want to project talent levels for the immediate future, we can convert the variance in talent over 180 days to the variance in talent over one day and then recalculate the regression constant based on the new variance.

In order to use the formula, though, we need to know the day-to-day correlation for talent. In practice, this value can be tricky to estimate since we don’t know how much talent is changing from day to day, but there are a couple ways we can do it.

One is to take samples of varying lengths and see how much the variance in talent (which we can estimate using the same reliability scores used to estimate the regression constant) diminishes over time.

As an example, I took the on-base percentage for all player-months from 2000-2016 where a hitter had at least 90 PA, and from that I get that the variance in OBP talent across the population is 0.000962. Then, I took all player-seasons over the same span where a hitter had at least 500 PA, and from that I get an OBP talent variance of 0.000910.

Going from d=30 to d=180 (the approximate number of days in a month and a season) lowers the variance in talent by a factor of 94.54 percent. That corresponds to a day-to-day correlation of 0.9989, where the variance in talent for one day is 0.000974, the talent variance over 30 days is 98.86 percent of the one-day value, and the talent variance over 180 days is 93.44 percent of the one-day value.

(This takes some playing around with the formula—it’s kind of messy to solve for two values of d, but you can set up the formula in Excel, R, etc. to quickly guess-and-check values and home in on the solution.)

Another way to estimate the day-to-day correlation is to measure the correlation between samples at varying intervals.

For example, if I take the monthly OBP data and find players with consecutive qualifying months, I get a correlation of 0.2506 for OBP one month to the next. If I then look at the same players two months apart, the correlation drops to 0.2432. Adding an extra 30-day gap between the samples lowered the correlation by a factor of 0.9704, which is equivalent to lowering it by a factor of 0.9704^(1/30) = 0.9990 each day. This value tells us how much the underlying talent levels correlate from one day to the next.

Using our two methods, we get r=0.9989 and r=0.9990. I should warn you, it is pretty lucky that our two methods yielded such similar values. You might occasionally get an estimate of 0.9950 using one method or set of data and r>1 using another (r>1 is clearly wrong, but sometimes you will run across something like the correlation actually going up when you go from one month apart to two months apart due to randomness in the data).

This is largely because we are dealing with ratios of very small numbers, and a small amount of randomness in the measurements can have an amplified effect. You can mitigate the randomness somewhat by looking at the data in as many ways as possible, such as comparing the correlation at several different intervals or looking at the variance at several different sample lengths, but you’ll inevitably have to deal with a degree of uncertainty in these estimates.

### General Principle No. 6

*The correlation of talent from one day to the next is key to calculating how much the spread in talent narrows over time, and thus to calculating a regression constant for varying sample lengths. This value can be tricky to estimate precisely, but we can get a general idea by evaluating how the variance or correlation of observed data changes over time. For many baseball stats, something on the order of 0.999 should be relatively close to the right value.*

### Conclusions

The regression constant is closely related to the amount of variance in your sample—the greater portion of that variance that comes from the spread in talent, the lower the regression constant, and vice versa.

We saw in the Elo vs. Regression article that weighting past results lowers the regression constant. This is because weighting past results lowers the random variation in the data, which in turn increases the proportion of overall variance that comes from talent.

However, weighting past data also implies that the underlying talent levels are changing over time (or, alternatively, that you are weighting for no reason). This also affects the variance, but in the opposite direction of weighting. That is, while weighting past data increases the proportion of variance attributable to talent, the changes in talent implied by weighting decrease that proportion.

This dampens the effect of weighting on the regression constant. The interaction of these two effects is complex and is not covered in this article, but the impact of variance discussed here is important to understanding that interaction.

That means this is still an intermediate step in understanding how changes in talent impact the math of regression to the mean. Before we can tackle the full effect, we need to understand why there is an effect and what factors contribute to this effect.

The most important factor in these calculations is the correlation of talent from one day to the next. This allows us to translate the day-to-day changes in talent into a quantifiable impact on variance, which in turn lets us measure the impact on the regression constant.

### References & Resources

- Adam Dorhauer, 3-D Baseball, “Math Behind Regression Talent Changes”
- Github, “Math Behind Regression with Changing Talent Levels”
- Adam Dorhauer, The Hardball Times, “Elo vs Regression to the Mean: A Theoretical Comparison”
- Russell Carleton, FanGraphs, “525,600 Minutes: How Do You Measure a Player in a Year?”
- Tom Tango, Tangotiger Blog, “Regression v. Elo” comments, particularly those by Jared Cross
- Wikipedia, “Geometric Series”
- Derek Carty, Baseball Prospectus, “Resident Fantasy Genius: When Hitters’ Stats Stabilize”
- Harry Pavlidis, The Hardball Times, “It makes sense to me, I must regress”
- Russell Carleton, Baseball Prospectus, “Baseball Therapy: It’s a Small Sample Size After All”
- Tom Tango, The Book Blog, “Updated r=.50 numbers”

“In terms of variance, it is the proportion of total variance in the sample that comes from variance in talent rather than random variance, and the reliability will be 0.5 when an equal amount of variance comes from both talent and random variation.”

Very interesting. But, why from a correlation standpoint are you using .5 rather than .7, if the goal is to explain half the variance? Much more importantly, what is the basis for the declaration that ALL variance not accounted for in the reliability measurement can by definition be only “random” as opposed to merely unobserved?

It’s a bit of a simplification to assume variance splits neatly into “talent” and “random”. If you want to build a more precise projection system, you could try to account for additional sources of variance, like differing parks, weather, opposing pitchers/hitters, etc, but for the purposes of investigating the math behind it, it helps to stick to simpler assumptions. In general, I think most projection systems try to avoid adding complexity whenever they can help it since you can usually get pretty close by using simplified assumptions, and each layer of added complexity can bring more risks (i.e. potential bugs… Read more »

This is largely incorrect. Because these reliability estimates are generally collected by within-sample pairs, the assumption is that the underlying talent is the same in each sample, because it’s the same player. That’s probably not true in the strong sense, but if you do the methodology right, you can get it to where you can say “close enough” and the entire point of what we’re trying to measure in reliability rests on the assumption that talent is essentially the same. Reliability analysis unto itself doesn’t answer the question of how well performance tracks underlying true talent. It measures how well… Read more »

The correlation between two observed samples, which each has random variation in the data, is going to be lower than the correlation between one observed sample and the underlying probabilities used to generate that sample, even if the same probabilities are used to generate both samples. Say you have two unfair coins, one of which comes up heads 1/3 of the time, and the other of which comes up heads 2/3 of the time. You know which coin is which. Flip each coin ten times and record the number of heads in one column, and the probability of getting heads… Read more »

The formatting of the table columns got squished together in the comment, but hopefully it’s clear enough what they should be.

Adam is correct.

When true-to-observed correlation is .71, then observed-to-observed correlation (given same trials) will be the square of that (or r=.50).

The tripping block for me is this ” that means 50% of the variance in one sample explains 50% of the variance in the other sample, and you end up with the variance in one sample explaining 25% of the variance in the other sample.” Variance in performance is a function of variance in talent and error. We’re assuming (falsely, but it’s an assumption of convenience) that talent is unchanging across what we’re sampling. We’re also assuming that none of that error is measurement error (again, probably false, but that’s what we’re doing) and that all the error is random.… Read more »

There are some major concerns that I have about this one. I appreciate the idea of trying to model the idea that true talent levels are not static and the use of decay-type modeling seems reasonable as an approach that might work to shed light on the problem. I buy that performance on an individual level will (over a large enough sample) shrink toward true talent level. If true talent levels were static (they aren’t, and I get that’s why you wrote the piece) then over a billion PA, everyone’s performance would be correlated with their true talent around 1.0.… Read more »

“I could split the questions up evens-and-odds and see how the 12 evens correlate with the 12 odds. That’s basic split-half. (Cronbach/KR-21 amplify this process by essentially taking all possible combinations of 12 and 12 and coming up with an average correlation.)” The average correlation given by Cronbach’s alpha/KR-21 is adjusted to a sample length of 24, though. It would be like taking all possible combinations of 12 and 12, averaging the correlations, and adjusting them from 12 observations to 24 using the Spearman-Brown prediction formula. If you did actually split the sample into 12 and 12, it would give… Read more »

I looked this one up and apparently, I have been doing this one wrong. Somewhere along the line when I learned Cronbach, I didn’t get the part about Spearman-Brown.

So, that means everything is even slower to stabilize than I thought. Great.

I’m curious why you think it’s more relevant to look at change in talent over time than other effects. Specifically, I think going from one ballpark to another and facing different starting pitchers is more likely to affect outcome than changes in talent from day to day. This is why I think it would be better to evaluate performance with an implicit model that uses the results of individual matchups rather than an explicit model that assumes zero variance in opponent talent and only addresses park effects with post box adjustments.