A Theoretical Blueprint for Improving MLEs by Dave Allen and Kevin Tenenbaum November 26, 2014 Editor’s Note: To give you a further taste of The Hardball Times Baseball Annual 2015, we are reprinting this article in its entirety. You can purchase the book here. Bryce Harper’s age impacted the way his MLEs were calculated. (via Matthew Straubmuller) The following article is probably different from what you would normally read in these pages. We don’t analyze data to find a specific result, but we take a theoretical approach to a concept that has been around for a long time. This article is simply a plan to improve upon the current construction of Major League Equivalencies (MLEs). The baseball analytics community has come a long way since Bill James called MLEs one of the most important ideas he ever had, but the structure of MLEs has changed little since James introduced them. MLEs are one of the most valuable and fascinating areas of baseball analytics, and there is still much work to be done before we can be satisfied with them. Tom Tango wrote an essay titled, “Issues with MLEs: Why I hate how they are used,” in which he outlines five issues he has with MLEs. He concludes the article with this summary of the current state of the adjustments: MLEs, as currently published, is a first step. We have many steps to go through before we can reduce the error range. We should not treat the currently-published MLEs as a final product.” We agree with Tango, and we feel that we can take a few steps towards improving the adjustments that we apply to minor league performance. While our list has some ideas in common with Tango’s, we expand and take our own stance on these values and provide our own propositions as to how we plan to handle the problems that we hope people will take on in the future. What Are MLEs? MLEs are a translation of minor league statistics to major league ones. In other words, they approximate how well a player would have done if he had been in the major leagues, based on his performance in the minor leagues. It is important to note that MLEs are not projections. You can think of them as exchange rates that only convert minor league statistics into a major league context. MLEs are typically multiplicative, which means that you simply multiply minor league performance by the MLE to get the adjusted value. For example, if the MLE from Triple-A to the majors for HRs is 0.8, and Player A hits 20 HR in Triple-A, we would have expected him to hit 0.8*20=16 homers in the majors if he had played at the top level over the same time period. MLEs make no claims about how Player A will perform in any other time period in the future. Goal of MLEs Any time you embark upon a project, it is a good idea to take a step back and lay out the model’s goals. We feel MLEs should be the best possible tool for translating a player’s performance from the minor leagues to the major leagues. These translations should be the best unbiased estimator of how a player would have performed in the major leagues given his minor league performance over the same time period for every player. Therefore, the adjustments need not be identical for all players at the same level, as variation in level may affect players differently. These MLEs can be used for a variety of work in baseball, but we feel they are most useful in player forecasting. While remembering that an MLE is not a forecast unto itself, it is an essential part of a good forecasting system that uses minor league performance. These adjustments are essential because the forecaster needs to be able to use minor league statistics in the same way he or she uses major league statistics. Therefore, he or she can use MLEs to adjust minor league statistics so they correspond to the major league context. In other words, MLEs are important, and we hope to get the ball rolling towards improving forecast accuracy for young and inexperienced players. Current Methodology Below, we outline the general methodology used to generate MLEs. First, restrict the sample of players to those who played at each of the two levels being compared over a short period (i.e. less than two years) but still have a decent sample size of plate appearances or total batters faced at both levels (i.e. > 50 plate appearances or total batters faced). Next, make a few other adjustments to ensure you are using the best sample of player possible. For instance, Brian Cartwright makes sure he only uses starting players that have at least 2.5 PA/G as a hitter. Then, take each of the performance samples and divide the major league performances by the minor league performances. The average of these ratios is the calculated MLE. The methodology becomes slightly more complicated when we examine jumps of more than one level (i.e. Double-A to MLB). In this case, the MLE can be constructed with or without a method that we call chaining. With chaining, we would calculate an MLE for both Double-A to Triple-A and Triple-A to MLB and multiply, or chain, the two together to find the MLE. The other option would be to ignore chaining and simply calculate the MLE for players that played in both Double-A and MLB within the allotted amount of time. We will discuss the merits of each of these approaches in the next section. Existing Issues There are several issues with the above methodology we feel have been unaddressed for too long. Below, we outline six major issues with the current implementation of MLEs we believe hinder the accuracy of major forecasting systems.A Hardball Times Updateby RJ McDanielGoodbye for now. 1. Minor League Over-performance When we look at the sample of players promoted from one level to another, we need to consider the thought process of the farm directors and general managers who decide when a player will be promoted. When making these decisions, they take several pieces of information into account, including roster construction, injuries, player attitude, scouting reports and player performance stats. Because these decision makers take a player’s minor league performance into account, promoted players are more likely to have been over-performing than under-performing their true talent level while in the minor leagues. Thus, we have a bias, as our MLEs will artificially be more extreme than they should be because the denominator (i.e. minor league performance) will overestimate players’ talent levels. 2. Aging Effects A second issue is that MLEs currently ignore player age, even though players will age between the middle of their minor league stint and the middle of their major league stint used to calculate the MLEs. This age difference poses a problem, as players of different ages will experience various magnitudes of aging effects on their performance. For instance, consider the difference between Bryce Harper and Rick Ankiel. Harper was first called up to the major leagues at the age of 19, while Ankiel was first called up as a position player at 28 years of age. If the differences between the average age in the major league and minor league stints were identical, they still would have different aging effects because Harper is at a much steeper part of the aging curve than Ankiel. Therefore, an MLE constructed from many more Bryce Harpers than Rick Ankiels would be artificially closer to one because the MLB performance would be inflated by the players who benefited from aging a year or two. These influences can be fairly large, and they might not wash out in the sample. Therefore, it is important that MLE calculations account for this bias in their calculations. 3. Skill Profile Another major issue with the current construction of MLEs is that league changes will affect each player differently. For instance, consider two position players with average strikeout percentages at Triple-A. These two players differ in that Player A has a high walk percentage, while Player B has a low walk percentage at the Triple-A level. Currently, we assume that a change in level will affect both these player identically on average. However, this may be a naïve assumption, as Player A could be a player who has good patience, while Player B might have a problem making contact on his swings. Therefore, we would expect these players to experience a promotion to MLB differently. We would think that Player A would have a lower strikeout percentage than Player B at the major league level because Player B strikes out from ineptitude, while Player A strikes out through strategy. Thus, the MLE process would benefit from the examination of a player’s other peripheral statistics instead of creating one value that applies to every player. 4. No Error Estimate Historically, MLEs have been treated as simple estimates of league adjustments, and researchers and forecasters have ignored their uncertainty levels. This neglect is unacceptable for MLEs. Our adjustment for a transition from High-A to the majors is much more uncertain that the corresponding adjustment from Triple-A to the bigs. This poses a major problem when we combine league-adjusted statistics for different minor league levels and regress to the mean. The major factor in how much we regress performance to the mean is the uncertainty around our true talent level estimates. In the majors, we can express this uncertainty in terms of the number of plate appearances in our sample. However, at the minor league level, we also must embed the uncertainty in our MLE estimate into the overall model uncertainty. Thus, we need to report our MLE estimates with a standard error in addition to the expected value of our MLEs. This will allow forecasters to essentially weight performance at higher levels more than lower levels based on the uncertainty of the MLE estimates, similar to how league-average performance is combined with player performance when we regress towards the mean. 5. A Better Look at Chaining Currently, there are two major schools of thought on MLEs. Most systems use a method called chaining. Chaining is the practice of calculating league adjustments for change of one level at a time (i.e. Double-A to Triple-A and Triple-A to MLB) and multiplying the levels a player must go through to reach the majors. For example the MLE for Double-A would be the product of an adjustment from Double-A to Triple-A and an adjustment from Triple-A to the majors. On the contrary, Brian Cartwright does not use chaining in his forecasting system, OLIVER. In an article he wrote for Baseball Prospectus in 2009, he shows that his MLEs perform better when they are not chained. While his results are intriguing, we believe that an inherent bias in the test sample means they require further testing. All of the players that Cartwright uses for testing are good enough to reach MLB. Therefore, the direct method, which only incorporates players that played at both levels, will be overly optimistic in its adjustments because it only includes players who are promoted very quickly. This bias could create more accurate forecasts for players that do reach the majors, but it could be detrimental for those players a team is unsure of whether to promote or not. This bias presents an inherent problem, as the MLE could see an improvement in test results while actually being a worse estimate. 6. Experience at Level Another bias current MLEs incorporate is the experience that a player has at a given level. There is evidence that minor league players perform better as they rack up experience at that level. This means there is a subsample of minor league performance data that is used in the calculation of MLEs that is inherently inflated above the players’ true talent levels. Thus, our MLE estimates would benefit from removing the effect that experience at a level has on player performance in the minor leagues. Solutions to Problems In this section, we will go through each of the six issues we outlined above and propose a potential change to the current MLE model that we would like to see people use to craft MLEs in the future. 1. Minor League Over-performance Promoted players are likely to be over performing their true talent levels (think Gregory Polanco when he was promoted to the Pirates). To work towards removing this selection bias, we propose regressing minor league stats towards the mean for MLEs. This will move the population average closer to the true talent level for the player group that we want to study (i.e. minor league players who could be promoted to the major leagues). The big question here is what value we should regress these performance statistics toward. We can think of two possible choices: Average of players within the league. This option does a great job of removing a portion of the overperforming player bias by shifting inflated statistics from overperforming players toward the league average. This change, therefore, does exactly what we want it to. However, the main reason to regress to the league average instead of the level mean would be that there are major differences in the environment of one league to the next at the same level. These differences in league mostly can be handled with park factors. Therefore, we would hope to find a larger sample for the prior. Average of entire level. Therefore, we propose that minor league performance statistics be regressed to the level mean. Therefore, an individual player’s Triple-A strikeout percentage performance would be regressed toward the average K% for all Triple-A players. This is the best answer we could come up with because it is the largest sample of available players that does not amplify the overperforming player bias. 2. Aging Effects Next, we will address the issue of aging. To incorporate the various magnitudes of aging effects on players in the major league and minor league samples, we propose all statistics are aged to the same year. This will essentially create age-neutral statistics for every player’s performance at both the minor league and major league level. We recommend performance be aged forward or backward to the peak age on the aging curve used. While it does not matter what age you choose, it does need to be consistent. The removal of aging bias from MLE calculations requires a different aging method than most researchers tend to use. Normally, people create aging curves using only major league performance. This method would not yield the optimal results, as you would be applying the aging effect to a sample of players that is different from the sample the curve was designed for. MLEs do not look at players who were in the majors for all years in the sample. Rather, they use data for players who have been promoted from the minor leagues to the major leagues recently. Therefore, we believe this information should be incorporated into any aging curve used to adjust performance in MLE calculations. 3. Skill Profile To solve the problem of level changes affecting each player differently, we propose that personalized MLEs be created for each player. Personalizing these adjustments would create a system in which every player is treated uniquely. For instance, a player who tends to strike out often would have a different MLE than a player who is able to make contact on all pitches. While we will not hypothesize on what these differences might look like, we can say that they should be treated differently when trying to predict their major league performances from their minor league performances. We propose a solution to this issue that uses similarity scores to construct an MLE for each player. In this case, a similarity score would fall between zero and one. You can think of this as a weight. Therefore, we would create an MLE in the same way that we normally would with one small change. Rather than taking the straight average of the ratio of major league to minor league performance, the weighted average would be computed with weights proportional to the similarity scores. Now, the most difficult part of this process actually would be to calculate the similarity scores. While it would be easy to use the simple similarity scores that Baseball-Reference publishes for all players, this would be an inappropriate measure of player similarity for MLE construction. Rather, we want to really key in on the variables that are important for successfully changing levels. To find these properties, it would be best to run a regression or other machine-learning technique to determine what variables have a lot of influence on how players adjust to new leagues. These factors would be different for each statistic we adjust and could range from biographical information like height to performance statistics like isolated power (ISO). We can calculate the difference between these values and then weight these differences to develop a similarity score to use in our weighted MLE and come up with a personalized MLE for each player to minimize our forecast errors. 4. No Error Estimate Next, we address the question of how to measure the uncertainty surrounding our MLE estimates. To make any inferences about the MLE sampling distribution, we must generate an estimate of the actual distribution. To do so, we will use a technique called bootstrapping. To create the MLE sampling distribution, we repeatedly take the average of a random sample (with replacement) of players’ ratios between their minor league and major league performance. Therefore, we have several averages of these resamples. We treat these averages as the resampling distribution of the MLE, thinking of it as representing how likely we are to draw each value of the MLE if we choose one at random. While we could report this distribution, it makes it difficult to incorporate into the forecasting process without making any assumptions. We assume the resampling distribution is normally distributed, which is a good assumption because of the central limit theorem (which states that if we take enough independent samples from any probability distribution, the average of the samples will follow a bell curve with its average being the true mean). Therefore, we can take the standard error of the MLE estimate as the standard deviation of the resampling distribution. We then use this standard deviation to weight level-adjusted statistics by their true uncertainty in addition to understanding their uncertainty. Because of the normality assumption, knowing the expected value (i.e. the MLE) and the standard error gives us everything we need to recreate the MLE distribution. 5. A Better Look at Chaining Chaining presents an issue we have found very difficult to solve. The biased data makes it almost impossible to objectively compare the two methods (i.e., chaining or not). With that said, we believe it to be important that we can confidently say which method we prefer, or that either method works just fine. Therefore, we propose a Bayesian framework to test the utility of each method. Rather than testing the MLEs in a simple forecast and choosing the model with the smaller error, we suggest that the errors are compared while holding onto a prior belief that chaining is the better method. This Bayesian procedure will allow for a comparison of the accuracy of the two methods with the understanding that the test data are biased in favor of not chaining. This is admittedly an imperfect solution and an area for further brainstorming. 6. Experience at Level Incorporating experience at level will prove to be a difficult task going forward. To make this happen, we must quantify the effect of playing another game at the same level. While it would be easy to assume this phenomenon is a linear effect (i.e. the performance boost from game one to game two is identical to the boost from game 1,000 to game 1,001), we cannot do that because we expect there to be diminishing marginal returns for the effect. Therefore, you would gain a larger boost for games earlier in your career at that level than you would for later games. What’s needed is a slightly more complicated model that does a good job handling diminishing marginal returns. We believe the best model for this would be logarithmic. Therefore, we want to approximate the effect the log (games played at level) has on the player’s performance. As always, we need to fit separate models for each statistic. While we would like to solve this issue with a simple linear regression, there will be issues here because of a sampling bias. Players who play more games at a certain level are probably performing worse than players who play fewer games. To counteract this bias, we assume that, on average, an extra game will affect all players identically. We understand this may not be a good assumption, as we treat it as a flaw for current MLEs, but we do not have a better solution at this time. Therefore, if we take a subset of players who have played at least n games, we can run a linear regression that uses the log (games played at level) to predict player performance at a given level based on the number of games played at that level. Then, we can actually adjust the performance statistics that go into the MLE model for the effect that experience at that level would have on each player. Summing Up MLEs can be improved upon, and here we have a blueprint for doing so. Some of these changes may not help the process, but we believe they are all worth at least testing and seeing if they lead to any measurable improvements in forecast accuracy. Improving them will be a complicated process, but we believe that the framework here will reap benefits for those who have the time to act upon them. References & Resources Tom Tango, Tangotiger, “Issues with MLEs: Why I hate how they are used” Brian Cartwright, Baseball Prospectus, “Prospectus Idol Entry: Brian Cartwright’s Initial Entry” Brian Cartwright, The Hardball Times, “Oliver, smarter than your average monkey”