# How accurately can we estimate a hitter’s runs? (Part 1)

One of the things that sabermetrics has gotten really, really good at is run estimators. We can estimate the heck out of some runs.

But run estimators are estimates of runs and estimates come with a certain amount of uncertainty.

Much has been written about the accuracy of run estimators. (The past *Hardball Times Annual* contains probably my best effort on the subject.) We have a pretty good handle on how accurate our run estimators are at the team level, for instance. (Typically somewhere within about 25 runs, depending on the estimator being tested and the sample involved.) We can test how accurate they are for pitchers, or games, or even half-innings.

And then we go ahead and print out our estimates of a hitter’s contributions to runs out to some number of decimal places.

But how accurate are those estimates? If we have two batters who are 0.1 runs apart, how confident are we that one is the better hitter than the other? If we have two batters who are 1 run apart, how confident are we?

One run? Five runs? 10 runs?

Let’s find out.

### Some points of clarification

Before we begin, let’s throw out three words that won’t appear past this section of the article: “true talent level.” We’re not trying to measure how talented a player is. (Okay, well sometimes we are. But not right now.) We’re interested in how valuable he was.

Again, our run estimates are run **estimates**. We’re pretty good at estimating runs based upon component stats, but we’re not perfect. They are going to be “off,” at least a certain percent of the time. This article is about figuring out by how much. It is not about true talent.

Also, this article is writen as though the only source of error in our run estimates is random error. This is not always true. Walks are the most frequent source of persistent error in a run estimator, with home runs the second most frequent.

### The quick way

Let’s take a look at a specific set of linear weights: specifically, my House Weights. I happen to think they’re one of the finer sets of LWTS out there, but of course I would think that (and if I didn’t, I would change them). (I used an expanded set designed to be used with the full set of Retrosheet events, rather than the published version that only considers official batting totals.)

For teams from 1993-2008, what I consider to be the “modern” era of offense, the typical batting team had a root mean square error of 22.38 runs in 6121 plate appearances. In other words, assuming a normal distribution, about 68 percent of our estimates should be within 22.38 runs of the actual runs. (This matches up nicely with the observed data: 69 percent of teams’ estimated runs are within 22.38 runs of the actual runs scored.) So let’s call that our standard error, or at least our estimate thereof. (From here on out, when I say error, what I mean is the difference between estimated runs and actual runs.)

Now, we know that if we figure the linear weights for each hitter (using the same linear weights and assuming that individual player and team totals reconcile) and add them together, we will get the team linear weight totals. By the same token, combining the error for each player should equal to the error at the team level.

This is where we need to be careful. If we add the absolute value of each player’s error, it will be much larger than our standard error – that’s because some players will have estimates lower than their actual production, and some players will have estimates higher than their actual production.

So how do we figure out the absolute value of each player’s value?

Our standard error is simply the square root of the variance, and as you may recall from last week’s article, variances add. So the standard error is the square root of the sum of the variance. And recall that adding two like terms is the same as multiplying them by two.

So let’s have x stand for the standard error of a single plate appearance. Let’s plug in all of our variables, like so:

22.38 = SQRT(6121*x^2)

In other words, our standard error at the team level is equal to the square root of the sum of the squares of the standard error of each plate appearance. Using a little algebra, we can solve for x; each plate appearance carries an standard error of roughly .286 runs. Then to figure out the standard error for a player given a certain number of plate appearances, we can simply plug in that value for x and we get a generalized formula of:

SQRT(PA*.286^2)

The interesting thing going on here is that our per-PA error goes down as our number of PAs goes up. Of course, the total standard error rises with PAs, but not linearly. Observe:

The vertical axis refers to the size of the standard error per plate appearance, the horizontal axis refers to the number of plate appearances. Pay close attention to the horizontal axis – it’s in the logarythmic scale, so that it grows exponentially between tick marks. As you can see, the graph drops very rapidly at the beginning and then levels off.

So for a player with 150 plate appearances, the standard error for our estimate of his contribution to team runs is 3.5 runs. At 300 PAs, that goes up to 4.95 runs. At 650 PAs, that goes up to 7.29 runs. (Remember, that’s absolute error, and can be either above or below the estimate.)

### Comparing players

So let’s say that we have two players that we want to compare. We know how to figure the standard error for each player. Now how about the standard error for the difference between each player? We need to combine the standard error for each player; again, recall that variances add:

SQRT(SD_P1^2 + SD_P2^2)

where SD_P1 is the error for player one and SD_P2 is the error for player two. That gives us the standard error for the difference between the two players. If the difference between the estimated runs for the two players is larger than twice the standard error, that means we’re about 68 percent confident that the player with the higher estimated runs actually produced more runs for his team. (If you want to know how far the distance needs to be between players to have 95 percent confidence, multiply the term for the error of the difference by 1.98.)

### Some cautions

As noted above, this is the “quick” way of figuring it out. Not all players have the same error per plate appearance: power hitters tend to have more variance per plate appearance than your typical hitter, for instance. Players who make a lot of outs have less variance per plate appearance. We’ll go into this in more detail next week.

**References & Resources**

The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at “www.retrosheet.org”.

Some very helpful math came from this article.

Colin – interesting article, as always. I think the following sentence has a couple problems: “If the difference between the estimated runs for the two players is smaller than twice the standard error, that means we’re about 68 percent confident that the player with the higher estimated runs actually produced more runs for his team.”

You mean *larger* than *one* standard error then we’re at least 68 percent?

If the difference between the estimated runs for the two players is smaller than twice the standard error, that means we’re about 68 percent confident that the player with the higher estimated runs actually produced more runs for his team.Colin – Shouldn’t this sentence read “larger” instead of “smaller”?

Whoops! Overlapping posts.

Yes, both of you are correct. Sorry about that.

What about accounting for in-team covariance of estimates? Adding variances is fine for differences of uncorrelated estimates, like we would get for players across different years or something, but it might make a difference if we are comparing players across a same team. Take Mauer and Morneau, two players who make up a large total portion of their teams runs, it seems plausible that many errors that lead us to over-rate Mauer’s production (protection in the lineup being the one that comes to mind) will lead us to under-rate Morneau’s production. This would make the test of differences less accurate.… Read more »