Predicting the 2018 MVP Winners with Machine Learning by Ryan Pollack October 29, 2018 Will Mookie Betts be the 2018 AL MVP? Machine learning might be able to tell us. (via Keith Allison) The 2018 American League MVP race reminds me of 2012. That year, Mike Trout led the sport in WAR but finished second to Miguel Cabrera, who had an inferior season but won the first Triple Crown in 45 years. Cabrera’s Detroit Tigers also went to the playoffs while Trout’s Los Angeles Angels finished in fourth place. Cabrera’s victory was quite controversial. This year, Mookie Betts plays the role of Trout, while J.D. Martinez stands in for Cabrera. The Boston right fielder led the major leagues in WAR but finished behind his DH teammate in home runs and RBI, two of the Triple Crown stats. Meanwhile, Martinez came reasonably close to a Triple Crown himself, ranking second in average and home runs and first in RBI. In addition to these two, the AL race again features Trout, with 9.8 WAR to Betts’ 10.4; Cleveland’s José Ramírez with 8.1 WAR and a 30-30 season that easily could’ve been a 40-40 one; and Houston’s Alex Bregman with 31 home runs and 7.6 WAR that helped his team win a surprisingly contentious AL West. The NL MVP race is less controversial but still interesting. Early in the season, Javier Baez of the Cubs made some noise by adding power and a touch more plate discipline to his game. After the All-Star break, Matt Carpenter hit dinger after dinger to vault the Cardinals into playoff contention and himself into the MVP race. In Colorado, Trevor Story had his anticipated breakout year, garnering MVP consideration while pushing the Rockies into a playoff spot. But the award likely will go to Milwaukee’s Christian Yelich, who racked up a monster 5.4 WAR in the season’s second half. The Brewers’ outfielder finished two home runs and an RBI short of a Triple Crown while leading the Brewers to a surprise the NL Central title. Humans like a good story, and Yelich provided a great one. With an interesting race in the American League and some good storylines in the National League, I wanted to predict the 2018 MVP races as objectively as possible while understanding what factors voters consider. To do this, I turned to machine learning; specifically, a popular machine learning package called xgboost. xgboost implements gradient boosting, a powerful decision-tree-based technique for predicting outcomes and classifying samples. It’s an ensemble method known for avoiding overfitting when handling correlated inputs, which helps when understanding how data points like OPS, home runs, and batting average influence the probability of winning an MVP award. I asked the the xgboost algorithm: Given data about 2018 players’ seasons, what’s the probability of each player winning his league’s MVP award? Approaching the Problem I ran many iterations of the model while playing around with lots of inputs. Does OPS matter in determining the MVP? The team he plays for? His defensive ranking? Do voters treat players differently whether they’re in the American League or National League? I also had to choose how to model the inputs. Do raw measurements of stats like OPS or home run totals produce the best results? How about park-adjusted stats? One key insight was converting most raw stats to rankings in that year’s respective leagues. It seems MVP voters think in terms of “Mike Trout leads the American League in OPS” instead of “Mike Trout has a 1.088 OPS in the American League.” Ranked data also better captures the competitive aspect of the MVP award and smooths out comparisons across seasons. We know a .300 batting average in 2007 means something different than a .300 batting average in 2018, but No. 1 is always No. 1. To avoid rewarding league-leading performances in a small sample, I analyzed only players with at least 2.93 plate appearances per team game. Why this threshold? Because it’s the fewest PA per team game of any MVP award winner (Willie Stargell, 1979). I didn’t want to miss anyone. I didn’t analyze pitchers. With possible apologies to Jacob deGrom this year, pitcher MVP seasons are rare. I didn’t see enough value in spending the computational or thought time in modeling their chances. There’s no reason I can’t, though. Consider it part of what will be included in the next version of the model. I classified players by their primary defensive position by seeing where they played the most innings in the field. Thus, Baez in 2018 is classified as a second baseman. If a player wasn’t on the field for at least 50 percent of the games he played in, I classified him as a DH. I looked at data for the 10 seasons prior to the year I was predicting. Meaning, to predict 2018’s MVPs, I looked at data from 2007 through 2017. This approach captures voters’ tendencies at the time of the vote.A Hardball Times Updateby RJ McDanielGoodbye for now. In the end, I used the following data points about each player who met the PA/G threshold: Season (to understand whether voting criteria change over time) League (to understand whether MVP voters weight AL performance differently from NL performance) OPS rank (within the player’s season-league, where 1 is the highest) AVG rank Team winning percentage. I used raw percentage here because I found ranking the teams in order of their finish in the division detracted from the model’s results. RBI rank HR rank SB rank. I know voters appreciate a speedy player; in particular, voters love players with a combination of power and speed. WAR rank Primary defensive position DEF Rank I examined all of these elements with respect to whether the player won the MVP award. Here are five random rows from the training data set so you can see what xgboost had to work with. Sample Data for MVP Predictor Algorithm Player League Season Position Team PCT RBI Rank OPS Rank SB Rank AVG Rank HR Rank WAR Rank DEF Rank Howie Kendrick AL 2013 2B 0.481 60 35 37 16 54 48 56 Yoenis Cespedes NL 2016 LF 0.537 20 12 51 30 7 22 42 Bengie Molina NL 2007 C 0.438 40 72 83 54 45 68 39 Salvador Perez AL 2016 C 0.500 56 66 79 65 45 45 14 Casey McGehee NL 2011 3B 0.593 33 69 70 70 46 68 11 Evaluating the Model I found the best model by removing 2011 AL and 2014 NL data (when Justin Verlander and Clayton Kershaw, respectively, won MVPs), since my dataset doesn’t contain pitchers, and repeating 10-fold cross-validation five times across a range of tuning parameters for the xgboost algorithm. The caret package helped tremendously. I then set the predicted MVP as the player in each league whose season had the highest probability of winning the MVP award. I evaluated the models using precision, recall, and the F1 score. I chose these metrics instead of sensitivity, specificity, and accuracy because the data set is wildly imbalanced. Out of 1,517 player-seasons, only 20 (1.3 percent) are classified as having won the MVP. With so little data about what makes an MVP, precision and recall represent the model’s performance better than sensitivity and specificity do. This choice of metrics also means I paid attention to the area under the precision-recall curve instead of the area under the receiver operating characteristics (ROC) curve. The final model resulted in the following confusion matrix. Confusion Matrix for MVP Predictor Model Is Actual MVP Is Not Actual MVP Is Predicted MVP 13 7 Is Not Predicted MVP 7 1490 In plain English: Of the 20 MVP winners: The model correctly labeled 13 winners as winners The model incorrectly labeled seven winners as non-winners Of the 1,497 non-winners: The model correctly labeled 1,490 non-winners as non-winners The model incorrectly labeled seven non-winners as winners The precision and recall here are both 0.65. The F1 score is the same. Could this be improved? Yes, but given the subjective nature of MVP voting, I think these results are good. The following graph shows the precision-recall curve and the area under it: An area under this curve of 1.0 would indicate perfect precision and perfect recall. Compared to that unattainable perfection, the area under this curve above is 0.745. Conversely, the area under the ROC curve (not shown here) is 0.82. The latter number is misleadingly high because of how easy it is to identify true negatives in this dataset: 98.6 percent of the time, just predict someone won’t win the MVP, and you’ll be right. Show Me the Winners! Numbers are great, but looking at names is even more fun. The following table shows the 13 correct predictions and the seven incorrect ones. Predicted and Actual MVP’s, 2007-2017 Year League Predicted MVP Actual MVP Accurate Prediction? 2007 AL Alex Rodriguez Alex Rodriguez Yes NL Matt Holliday Jimmy Rollins No 2008 AL Alex Rodriguez Dustin Pedroia No NL Albert Pujols Albert Pujols Yes 2009 AL Joe Mauer Joe Mauer Yes NL Albert Pujols Albert Pujols Yes 2010 AL Josh Hamilton Josh Hamilton Yes NL Joey Votto Joey Votto Yes 2011 NL Ryan Braun Ryan Braun Yes 2012 AL Miguel Cabrera Miguel Cabrera Yes NL Joey Votto Buster Posey No 2013 AL Miguel Cabrera Miguel Cabrera Yes NL Paul Goldschmidt Andrew McCutchen No 2014 AL Mike Trout Mike Trout Yes 2015 AL Josh Donaldson Josh Donaldson Yes NL Bryce Harper Bryce Harper Yes 2016 AL David Ortiz Mike Trout No NL Daniel Murphy Kris Bryant No 2017 AL Jose Altuve Jose Altuve Yes NL Charlie Blackmon Giancarlo Stanton No Most of the incorrect predictions placed in the top 10 in their league’s voting that year. The one exception is Votto in 2012, who placed 14th. (Behind Jay Bruce? Really?!!??!) Who Wins in 2018? The true test of any model is how it performs on data it hasn’t seen before. We don’t know the 2018 voting results yet, but the model’s predictions are reasonable. The following graphs show the players in each league with the ten highest probabilities of winning the MVP award: In the AL, seasons like the ones belonging to Betts and Martinez have high probabilities of winning the award. Betts had an historic all-around year and would be an excellent choice for the award. But don’t forget about Martinez, whose home run and RBI totals will get him more consideration than many might think. Meanwhile, Ramírez will get some stray votes for his outstanding season, and Trout continues to make voters look past the “playoff team” criterion. Bregman’s presence on the list shows that chants of “MVP! MVP!” during his playoff at-bats were not undeserved. It would not surprise me if these players finished ranked 1–5 in the order listed above. Yelich dominates the NL field, and no one else is close. Arenado and Story will get some support on the backs of strong seasons that pushed their team into the NL Wild Card game. The model thinks the same of Baez, and Max Muncy’s presence reminds us how strong his season was even though he didn’t qualify for the batting title. Carpenter is nowhere to be found. I suspect he would be if the Cardinals had finished higher in the NL Central. Speaking of Arenado and Story, this is a good time to note that I found no “team bias” effect when modeling. Voters didn’t punish Rockies players for playing at altitude, nor did voters reward Red Sox and Yankees players with an “East Coast bias.” Aside from Trout and Juan Soto, all the names above come from playoff teams. While the MVP ballot states the award winner “need not come from a division winner or other playoff qualifier,” voters clearly have difficulty voting for players on just-okay or losing teams. Speaking of winning teams, the following graph shows what the model thinks are the most important criteria in determining the MVP winner: Today’s MVPs must hit well, as measured by OPS and batting average. Voters clearly want their MVPs to come from winning teams. And despite RBI falling out of favor as a metric for evaluating hitters, MVP voters still prefer players who drive in runs. This fact hurts Trout, who ranked 24th in the AL in RBI among the players examined. Consider also that Martinez ranks first in RBI, whereas Betts ranks 22nd. Further down the list, home runs and stolen base prowess come into play. It seems voters do love a good power-speed story. The “Season” factor shows voting criteria change over time. Finally, voters give a bit of consideration to the league the players are in, as well as whether they’re catchers or designated hitters. The remaining factors had little to no influence on voters’ minds, at least not in this data set. I was surprised to see defensive position wasn’t more of a factor in the voting. Perhaps for this reason, and also because players’ defensive ranking isn’t that important, WAR by itself doesn’t mean as much as pure hitting talent for a good team. Despite some limitations, this model is a good first step at identifying MVP-caliber seasons and understanding the criteria voters use to decide on the award. In future versions, I’d like to incorporate pitchers, implement techniques to better handle the imbalanced training data, and identify more relevant metrics for position players. I’d also like to separate OPS into OBP and SLG to see which is more important. Finally, instead of predicting a pure winner, I would like to predict MVP voting placement. These changes should result in a more usable and interesting model.