# A Theoretical Blueprint for Improving MLEs

*Editor’s Note: To give you a further taste of The Hardball Times Baseball Annual 2015, we are reprinting this article in its entirety. You can purchase the book here.*

The following article is probably different from what you would normally read in these pages. We don’t analyze data to find a specific result, but we take a theoretical approach to a concept that has been around for a long time. This article is simply a plan to improve upon the current construction of Major League Equivalencies (MLEs).

The baseball analytics community has come a long way since Bill James called MLEs one of the most important ideas he ever had, but the structure of MLEs has changed little since James introduced them. MLEs are one of the most valuable and fascinating areas of baseball analytics, and there is still much work to be done before we can be satisfied with them.

Tom Tango wrote an essay titled, “Issues with MLEs: Why I hate how they are used,” in which he outlines five issues he has with MLEs. He concludes the article with this summary of the current state of the adjustments:

MLEs, as currently published, is a first step. We have many steps to go through before we can reduce the error range. We should not treat the currently-published MLEs as a final product.”

We agree with Tango, and we feel that we can take a few steps towards improving the adjustments that we apply to minor league performance. While our list has some ideas in common with Tango’s, we expand and take our own stance on these values and provide our own propositions as to how we plan to handle the problems that we hope people will take on in the future.

### What Are MLEs?

MLEs are a translation of minor league statistics to major league ones. In other words, they approximate how well a player would have done if he had been in the major leagues, based on his performance in the minor leagues. It is important to note that MLEs are not projections. You can think of them as exchange rates that only convert minor league statistics into a major league context.

MLEs are typically multiplicative, which means that you simply multiply minor league performance by the MLE to get the adjusted value. For example, if the MLE from Triple-A to the majors for HRs is 0.8, and Player A hits 20 HR in Triple-A, we would have expected him to hit 0.8*20=16 homers in the majors if he had played at the top level over the same time period. MLEs make no claims about how Player A will perform in any other time period in the future.

### Goal of MLEs

Any time you embark upon a project, it is a good idea to take a step back and lay out the model’s goals. We feel MLEs should be the best possible tool for translating a player’s performance from the minor leagues to the major leagues. These translations should be the best unbiased estimator of how a player would have performed in the major leagues given his minor league performance over the same time period for every player. Therefore, the adjustments need not be identical for all players at the same level, as variation in level may affect players differently.

These MLEs can be used for a variety of work in baseball, but we feel they are most useful in player forecasting. While remembering that an MLE is not a forecast unto itself, it is an essential part of a good forecasting system that uses minor league performance. These adjustments are essential because the forecaster needs to be able to use minor league statistics in the same way he or she uses major league statistics. Therefore, he or she can use MLEs to adjust minor league statistics so they correspond to the major league context. In other words, MLEs are important, and we hope to get the ball rolling towards improving forecast accuracy for young and inexperienced players.

### Current Methodology

Below, we outline the general methodology used to generate MLEs.

First, restrict the sample of players to those who played at each of the two levels being compared over a short period (i.e. less than two years) but still have a decent sample size of plate appearances or total batters faced at both levels (i.e. > 50 plate appearances or total batters faced). Next, make a few other adjustments to ensure you are using the best sample of player possible. For instance, Brian Cartwright makes sure he only uses starting players that have at least 2.5 PA/G as a hitter. Then, take each of the performance samples and divide the major league performances by the minor league performances. The average of these ratios is the calculated MLE.

The methodology becomes slightly more complicated when we examine jumps of more than one level (i.e. Double-A to MLB). In this case, the MLE can be constructed with or without a method that we call chaining. With chaining, we would calculate an MLE for both Double-A to Triple-A and Triple-A to MLB and multiply, or chain, the two together to find the MLE. The other option would be to ignore chaining and simply calculate the MLE for players that played in both Double-A and MLB within the allotted amount of time. We will discuss the merits of each of these approaches in the next section.

### Existing Issues

There are several issues with the above methodology we feel have been unaddressed for too long. Below, we outline six major issues with the current implementation of MLEs we believe hinder the accuracy of major forecasting systems.

#### 1. Minor League Over-performance

When we look at the sample of players promoted from one level to another, we need to consider the thought process of the farm directors and general managers who decide when a player will be promoted. When making these decisions, they take several pieces of information into account, including roster construction, injuries, player attitude, scouting reports and player performance stats. Because these decision makers take a player’s minor league performance into account, promoted players are more likely to have been over-performing than under-performing their true talent level while in the minor leagues.

Thus, we have a bias, as our MLEs will artificially be more extreme than they should be because the denominator (i.e. minor league performance) will overestimate players’ talent levels.

#### 2. Aging Effects

A second issue is that MLEs currently ignore player age, even though players will age between the middle of their minor league stint and the middle of their major league stint used to calculate the MLEs. This age difference poses a problem, as players of different ages will experience various magnitudes of aging effects on their performance.

For instance, consider the difference between Bryce Harper and Rick Ankiel. Harper was first called up to the major leagues at the age of 19, while Ankiel was first called up as a position player at 28 years of age. If the differences between the average age in the major league and minor league stints were identical, they still would have different aging effects because Harper is at a much steeper part of the aging curve than Ankiel. Therefore, an MLE constructed from many more Bryce Harpers than Rick Ankiels would be artificially closer to one because the MLB performance would be inflated by the players who benefited from aging a year or two. These influences can be fairly large, and they might not wash out in the sample. Therefore, it is important that MLE calculations account for this bias in their calculations.

#### 3. Skill Profile

Another major issue with the current construction of MLEs is that league changes will affect each player differently. For instance, consider two position players with average strikeout percentages at Triple-A. These two players differ in that Player A has a high walk percentage, while Player B has a low walk percentage at the Triple-A level. Currently, we assume that a change in level will affect both these player identically on average.

However, this may be a naïve assumption, as Player A could be a player who has good patience, while Player B might have a problem making contact on his swings. Therefore, we would expect these players to experience a promotion to MLB differently. We would think that Player A would have a lower strikeout percentage than Player B at the major league level because Player B strikes out from ineptitude, while Player A strikes out through strategy. Thus, the MLE process would benefit from the examination of a player’s other peripheral statistics instead of creating one value that applies to every player.

#### 4. No Error Estimate

Historically, MLEs have been treated as simple estimates of league adjustments, and researchers and forecasters have ignored their uncertainty levels. This neglect is unacceptable for MLEs. Our adjustment for a transition from High-A to the majors is much more uncertain that the corresponding adjustment from Triple-A to the bigs. This poses a major problem when we combine league-adjusted statistics for different minor league levels and regress to the mean.

The major factor in how much we regress performance to the mean is the uncertainty around our true talent level estimates. In the majors, we can express this uncertainty in terms of the number of plate appearances in our sample. However, at the minor league level, we also must embed the uncertainty in our MLE estimate into the overall model uncertainty. Thus, we need to report our MLE estimates with a standard error in addition to the expected value of our MLEs. This will allow forecasters to essentially weight performance at higher levels more than lower levels based on the uncertainty of the MLE estimates, similar to how league-average performance is combined with player performance when we regress towards the mean.

#### 5. A Better Look at Chaining

Currently, there are two major schools of thought on MLEs. Most systems use a method called chaining. Chaining is the practice of calculating league adjustments for change of one level at a time (i.e. Double-A to Triple-A and Triple-A to MLB) and multiplying the levels a player must go through to reach the majors. For example the MLE for Double-A would be the product of an adjustment from Double-A to Triple-A and an adjustment from Triple-A to the majors.

On the contrary, Brian Cartwright does not use chaining in his forecasting system, OLIVER. In an article he wrote for Baseball Prospectus in 2009, he shows that his MLEs perform better when they are not chained. While his results are intriguing, we believe that an inherent bias in the test sample means they require further testing.

All of the players that Cartwright uses for testing are good enough to reach MLB. Therefore, the direct method, which only incorporates players that played at both levels, will be overly optimistic in its adjustments because it only includes players who are promoted very quickly. This bias could create more accurate forecasts for players that do reach the majors, but it could be detrimental for those players a team is unsure of whether to promote or not. This bias presents an inherent problem, as the MLE could see an improvement in test results while actually being a worse estimate.

#### 6. Experience at Level

Another bias current MLEs incorporate is the experience that a player has at a given level. There is evidence that minor league players perform better as they rack up experience at that level. This means there is a subsample of minor league performance data that is used in the calculation of MLEs that is inherently inflated above the players’ true talent levels. Thus, our MLE estimates would benefit from removing the effect that experience at a level has on player performance in the minor leagues.

### Solutions to Problems

In this section, we will go through each of the six issues we outlined above and propose a potential change to the current MLE model that we would like to see people use to craft MLEs in the future.

#### 1. Minor League Over-performance

Promoted players are likely to be over performing their true talent levels (think Gregory Polanco when he was promoted to the Pirates). To work towards removing this selection bias, we propose regressing minor league stats towards the mean for MLEs. This will move the population average closer to the true talent level for the player group that we want to study (i.e. minor league players who could be promoted to the major leagues).

The big question here is what value we should regress these performance statistics toward. We can think of two possible choices:

- Average of players within the league. This option does a great job of removing a portion of the overperforming player bias by shifting inflated statistics from overperforming players toward the league average. This change, therefore, does exactly what we want it to. However, the main reason to regress to the league average instead of the level mean would be that there are major differences in the environment of one league to the next at the same level. These differences in league mostly can be handled with park factors. Therefore, we would hope to find a larger sample for the prior.
- Average of entire level. Therefore, we propose that minor league performance statistics be regressed to the level mean. Therefore, an individual player’s Triple-A strikeout percentage performance would be regressed toward the average K% for all Triple-A players. This is the best answer we could come up with because it is the largest sample of available players that does not amplify the overperforming player bias.

#### 2. Aging Effects

Next, we will address the issue of aging. To incorporate the various magnitudes of aging effects on players in the major league and minor league samples, we propose all statistics are aged to the same year. This will essentially create age-neutral statistics for every player’s performance at both the minor league and major league level.

We recommend performance be aged forward or backward to the peak age on the aging curve used. While it does not matter what age you choose, it does need to be consistent. The removal of aging bias from MLE calculations requires a different aging method than most researchers tend to use. Normally, people create aging curves using only major league performance. This method would not yield the optimal results, as you would be applying the aging effect to a sample of players that is different from the sample the curve was designed for.

MLEs do not look at players who were in the majors for all years in the sample. Rather, they use data for players who have been promoted from the minor leagues to the major leagues recently. Therefore, we believe this information should be incorporated into any aging curve used to adjust performance in MLE calculations.

#### 3. Skill Profile

To solve the problem of level changes affecting each player differently, we propose that personalized MLEs be created for each player. Personalizing these adjustments would create a system in which every player is treated uniquely. For instance, a player who tends to strike out often would have a different MLE than a player who is able to make contact on all pitches. While we will not hypothesize on what these differences might look like, we can say that they should be treated differently when trying to predict their major league performances from their minor league performances.

We propose a solution to this issue that uses similarity scores to construct an MLE for each player. In this case, a similarity score would fall between zero and one. You can think of this as a weight. Therefore, we would create an MLE in the same way that we normally would with one small change. Rather than taking the straight average of the ratio of major league to minor league performance, the weighted average would be computed with weights proportional to the similarity scores.

Now, the most difficult part of this process actually would be to calculate the similarity scores. While it would be easy to use the simple similarity scores that Baseball-Reference publishes for all players, this would be an inappropriate measure of player similarity for MLE construction. Rather, we want to really key in on the variables that are important for successfully changing levels. To find these properties, it would be best to run a regression or other machine-learning technique to determine what variables have a lot of influence on how players adjust to new leagues. These factors would be different for each statistic we adjust and could range from biographical information like height to performance statistics like isolated power (ISO). We can calculate the difference between these values and then weight these differences to develop a similarity score to use in our weighted MLE and come up with a personalized MLE for each player to minimize our forecast errors.

#### 4. No Error Estimate

Next, we address the question of how to measure the uncertainty surrounding our MLE estimates. To make any inferences about the MLE sampling distribution, we must generate an estimate of the actual distribution. To do so, we will use a technique called bootstrapping. To create the MLE sampling distribution, we repeatedly take the average of a random sample (with replacement) of players’ ratios between their minor league and major league performance. Therefore, we have several averages of these resamples. We treat these averages as the resampling distribution of the MLE, thinking of it as representing how likely we are to draw each value of the MLE if we choose one at random.

While we could report this distribution, it makes it difficult to incorporate into the forecasting process without making any assumptions. We assume the resampling distribution is normally distributed, which is a good assumption because of the central limit theorem (which states that if we take enough independent samples from any probability distribution, the average of the samples will follow a bell curve with its average being the true mean). Therefore, we can take the standard error of the MLE estimate as the standard deviation of the resampling distribution. We then use this standard deviation to weight level-adjusted statistics by their true uncertainty in addition to understanding their uncertainty. Because of the normality assumption, knowing the expected value (i.e. the MLE) and the standard error gives us everything we need to recreate the MLE distribution.

#### 5. A Better Look at Chaining

Chaining presents an issue we have found very difficult to solve. The biased data makes it almost impossible to objectively compare the two methods (i.e., chaining or not). With that said, we believe it to be important that we can confidently say which method we prefer, or that either method works just fine. Therefore, we propose a Bayesian framework to test the utility of each method. Rather than testing the MLEs in a simple forecast and choosing the model with the smaller error, we suggest that the errors are compared while holding onto a prior belief that chaining is the better method. This Bayesian procedure will allow for a comparison of the accuracy of the two methods with the understanding that the test data are biased in favor of not chaining. This is admittedly an imperfect solution and an area for further brainstorming.

#### 6. Experience at Level

Incorporating experience at level will prove to be a difficult task going forward. To make this happen, we must quantify the effect of playing another game at the same level. While it would be easy to assume this phenomenon is a linear effect (i.e. the performance boost from game one to game two is identical to the boost from game 1,000 to game 1,001), we cannot do that because we expect there to be diminishing marginal returns for the effect. Therefore, you would gain a larger boost for games earlier in your career at that level than you would for later games.

What’s needed is a slightly more complicated model that does a good job handling diminishing marginal returns. We believe the best model for this would be logarithmic. Therefore, we want to approximate the effect the log (games played at level) has on the player’s performance. As always, we need to fit separate models for each statistic.

While we would like to solve this issue with a simple linear regression, there will be issues here because of a sampling bias. Players who play more games at a certain level are probably performing worse than players who play fewer games. To counteract this bias, we assume that, on average, an extra game will affect all players identically. We understand this may not be a good assumption, as we treat it as a flaw for current MLEs, but we do not have a better solution at this time. Therefore, if we take a subset of players who have played at least n games, we can run a linear regression that uses the log (games played at level) to predict player performance at a given level based on the number of games played at that level. Then, we can actually adjust the performance statistics that go into the MLE model for the effect that experience at that level would have on each player.

### Summing Up

MLEs can be improved upon, and here we have a blueprint for doing so. Some of these changes may not help the process, but we believe they are all worth at least testing and seeing if they lead to any measurable improvements in forecast accuracy. Improving them will be a complicated process, but we believe that the framework here will reap benefits for those who have the time to act upon them.

### References & Resources

- Tom Tango, Tangotiger, “Issues with MLEs: Why I hate how they are used”
- Brian Cartwright, Baseball Prospectus, “Prospectus Idol Entry: Brian Cartwright’s Initial Entry”
- Brian Cartwright, The Hardball Times, “Oliver, smarter than your average monkey”

If the primary purpose is forecasting, why do you need MLEs anyway? If one is drilling down to the level of skill profile, why not simply do away with MLEs (at least mostly). You do need to have some of equivalency between the minor leagues and from there you can simply look at subsequent performance of players of same age and similar skill profile/performance for forecasting. I do not see that it is necessary to introduce the intermediate step of an MLE. How about calling it a “neutral triple A equivalency”, but with the understanding that it is not simply a slash line?

So, let’s take Dalton Pompey’s 2014 season. He is 21. He hits .315/.394/.467 in 317 PAs in Dunedin with a .375 BABIP, an IsoP of .152, a 46.9% GB rate, a 17.7% K rate and a 11%BB rate. He steals 28 bases and is caught stealing 2 times. Let’s say that the translation from High A to neutral triple A requires a mulitiplier of .9 (on average) for BABIP, .8 for IsoP, 1.0 for GB rate, 1.2 for K rate and .8 for BB rate. In Pompey’s case, you would probably want to use a higher BABIP multiplier because of the evident speed, let’s say .95. We end up with a triple A equivalent of .356 BABIP, .121 IsoP, 21.2% K rate and 8.8% BB rate. Pompey’s neutral triple A equivalent becomes .284/.347/.396 with a 28/67 W/K and 5 homers and perhaps 26-4 SB/CS in 317 PAs. You then do the same thing for his double A work and convert his triple A work in Buffalo to a neutral setting. You then can get a full neutral triple A equivalent line for his 500 PAs from High A to triple A in 2014. Let’s say the result is .287/.350/.400 with 45/100 W/K , 9 homers, a 47% GB rate, a 9% W rate, a 20% K rate, and a 40/9 SB/CS rate.

The hard part of projection is not this part, but rather what a 21 year old is likely to do in the major leagues the following season if given a 500 PA trial? Most players with this projection do not get such a trial, with performance in spring training and early the following season in the high minors playing a significant role (as well as the player’s defensive position and ability). You could look at 21 year olds with very comparable performance, and 21 year old centerfielders with somewhat comparable performance to see what the average performance is the following year. You will have a disproportionate number of players who do not receive any significant playing time that year, due in part but not wholly to ability.

I do not think it adds anything in the projection process to make the step from neutral triple A equivalency (over several years if available) to MLE.

Thanks for the comment Mike. I think that the question stems a lot from different definitions we may have for what an MLE is. When I think of an MLE, I do not see a slash line of translated statistics. Rather, I think of a quantification of the effect moving from league A to league B has on a player. In our case, league B happens to be the MLB, but it really doesn’t matter in terms of forecasting, as long as you are consistent across players.

You are correct in saying that you don’t

needMLEs in your forecast if you just look at how “similar” players performed in the past. We can say that two players are”similar” if they have similar skill profiles and performances. MLEs can make your forecast more accurate, as they allow you to increase your sample size of “similar” players.If you do not use MLEs you will have to consider similar performance at each individual level. If you adjust for level, however, you can simply consider similarities in adjusted performance. Therefore, your pool of players is much larger, and you should have a more accurate population to forecast with.

Thanks, Kevin. OK, I see the point. I doubt though that you will get a more accurate projection. What you gain in a larger sample size, you lose in the uncertainty of the minors to majors translation and the difference between minor league PAs and major league PAs. I should elaborate on the latter issue. The very fact that a player has had a significant number of major league PAs matters because of service time/option rules.

So, let’s say that you have two players with similar skill sets to Pompey. One has played in the minor leagues at age 21 and has a similar neutral triple A performance to Pompey. Another has played in the major leagues at age 21 and has major league performance in 500 PAs which you determine to be equivalent to Pompey’s minor league performance. The very fact that he has received 500 major league PAs makes him different in an important way from Pompey.

Regarding the first half of the article – age is now an integral part of my Oliver MLEs. I adjust all player/team/seasons for ballpark and exact decimal age before making comparisons to calculate the league factors. I calculate two types of MLEs, with ‘present’ and ‘peak’ (age 27) values. Factors are calculated for and applied to the rates for each component stat, and are combined in a decision tree to get the final counting stats total. For example, a higher SO% means fewer balls in play, which results in fewer total home runs, even if the HR% hadn’t changed.

Once I have ballparks, ages and leagues accounted for, I combine the performances to create a projection. It is at that time that I consider the reliability of each sample, applying weights for that as well as overall recency. The amount of regression is based on the overall sample size for each rate stat. The less reliable a sample is, the less weight it receives, and thus the more the overall projection that it is a part of is regressed.

Good stuff. A lot to think about here.

I think that once you start age-adjusting the stats (which I agree, of course that you should) you can start using data pairs for transitions that aren’t in the same or even consecutive years. So, if you age adjusted a player’s stats from high-A, you can compare them to his MLB stats three years later to get the high-A to MLB factor. When coming up with the factors, transitions with more time/years in between them would just receive less weight than transitions where the two sets are data are closer in time (b/c the player is likely to have changed more in age-adjusted ability).

Mike, I think the solution to the choice between basing the analysis on similar players and using all players is to give more weight to more similar players but to give all players some weight.

Re: #4 the Error Estimate, I think this is maybe harder than it sounds or at least I wouldn’t be at all sure how to do it and come up with the appropriate weights. I think you’d have to divvy up the differences in (adjusted) performance into parts that are binomial variance, uncertainty aging/change in ability and uncertainty in the MLE factor. I agree with the ultimate conclusion that this leads to lower weights as you go down further in the minors and doing some analysis would be better than our current method of just guessing what those weights might be.

I’ll look further into bootstrapping to see if it is something that I might be able to utilize.

Regarding chaining vs direct, I agree there are some biases in the process, but in the direct there is only one step where the biases may appear, in chaining it’s present at every step and grows each level further away from MLB. As Jared noted, if the samples are properly adjusted for age there doesn’t have to be a strict limit on number of years between samples.

Mike – I use three adjustments to the player’s performance data in marking a projection

1) for age

2) for ball parks

3) for level of competition

What Kevin is describing in this article is how to best determine the level of competition in a given league, relative to MLB.

Even if one never calculates or publishes a single season ‘MLE’, the level of competition is a necessary factor in the computation of projections

This is a non-MLE specific question, but something that came up when I saw your point about only considering players with >50 PAs. Obviously this brings about the weirdness of someone with 51 PAs making the sample and someone with 50 not, even though one isn’t any more predictive than the other. Is there a good reason not to include every possible sample, and also include a weight factor? Not linear, but something that followed the curve of significance closely, so that the difference between the factor for a player with 10 PAs and one with 50 is similar to the factor for a player with 50 and one with 200.

This seems like a simple idea, and I have to imagine someone else has suggested it before, so really what I’m asking is for someone smarter than me to explain why this isn’t really a good idea. Thanks!

PS — more relevantly to the article, this is really cool. I wasn’t planning on buying the Annual, but I think that just changed.

You’re correct, another method is to weight by either the smaller of the two samples being compared, or the harmonic mean of the two.

A more important issue, which Kevin mentioned, is that bench players suffer the ‘pinch hit penalty’m which is about .030 wOBA IIRC. Guys may have played every day in Triple-A, then started once a week and did some PHing once promoted. Those are not apples to apples comparisons, so I use a PA/G cut off to try to get only players with regular playing time in each sample.

For Royals’ minor leaguers, one of the main adjustments that needs to be made to projection is park factor. Both AA NW Arkansas and AAA ‘Chasers play in band box parks more suitable to develop hitters for the AL East than the K in the central. We’ve seen that often enough with power bats like Hosmer and Moustakas over the years.

For predictive value, Wilmington may be the best the Royals have, large pitchers’ park facing pitchers usually more advanced for their professional experience than the hitters are. A hitter who does well in Wilmington is likely to do better in KC than a hitter who struggles at A+ then lights up Springdale and Omaha.

Jim – We assumed that park factors were already controlled for at both the minor league and major league level. Sorry if we didn’t mention that in the article.

Brian – Thanks for the comments. I am happy to see that you are using aging in your MLEs now. Sorry that we weren’t aware of it. Can you talk a little bit more about how you use decision trees? I am not sure I follow. I am excited to see what changes you make in the future.

Great discussion here.

@Brian Is there an ETA for Oliver projections to be up on Fangraphs? Also, perhaps you could post an article with the top ~10 prospects your system is unusually high on for the long-term?