Predicting the 2018 Cy Young Race with Machine Learning
After predicting the MVP award races a few weeks ago, I have turned my attention to the Cy Young races. Who’s likely to win? What factors do voters use to decide on the award? Who won’t win but had a season worthy of consideration?
The American League finalists each have strong credentials. Justin Verlander had a traditionally dominant season, leading the league in strikeouts and pitching nearly as many innings as the leader, Corey Kluber. Blake Snell won 21 games for a surprise Tampa Bay Rays team and finished with an eye-catching 1.89 ERA. Kluber had a sensational season himself, winning 20 games and walking just four percent of the batters he faced.
The National League also features three excellent candidates. Jacob deGrom finished with the second-best FIP and fourth-best ERA of any qualified starter in the past 30 years. He, or rather his teammates, also made headlines for not scoring many runs while he was on the mound. Max Scherzer turned in another incredible season in what likely is a Hall of Fame career. Gunning for his third straight Cy Young award and fourth overall, the Nationals righty struck out 300 batters and led the NL in innings. Rounding out the top three candidates is a breakout season from the Phillies’ Aaron Nola.
Building the Model
To predict the winners, I once again turned to the xgboost algorithm supported by the caret package. I took data from starters who pitched at least 175 innings (so I could analyze Snell), transformed most of the raw data into league rankings, and used 10-fold cross-validation across a wide range of tuning parameters to find the model that produced the best results. As I did when predicting MVP winners, I evaluated each model using precision, recall, F1, and area under the precision-recall curve. These metrics reflect the rarity of award winners in a large data set. I also subjectively looked at the predictions to ensure they weren’t off base.
I gave the following data points to the xgboost algorithm:
- Season: to understand whether voting trends change over time
- League: to understand whether voters treat AL and NL pitchers differently
- Team Winning Percentage: to understand whether voters care about how well the player’s team performed
- Innings Pitched
- The player’s league ranking in the following stats:
- Wins
- Losses (where a higher ranking indicates fewer losses)
- WAR (the FIP-based variety)
- ERA
- FIP
- Strikeouts
- Walks allowed
- HR allowed
- K%
- BB%
- K-BB%
- HR/9
To capture current voting trends, I looked at data from the 2007 through ’17 seasons. I ignored relievers, whom I defined as pitchers who started fewer than 50 percent of the games they appeared in. Like starting pitchers winning an MVP award, relievers win a Cy Young award so rarely that including them isn’t worth the effort.
Here are five random rows from the training data set, with players’ names added for clarity.
Player | Season | League | IP | Win Rank | Loss Rank | ERA Rank | FIP Rank | K Rank | BB Rank | HR Rank | fWAR Rank | K% Rank | BB% Rank | K-BB% Rank | HR/9 Rank | Team Winning % |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Justin Verlander | 2010 | AL | 224.1 | 4 | 7 | 9 | 2 | 4 | 21 | 3 | 1 | 4 | 16 | 5 | 3 | 0.500 |
Mike Minor | 2013 | NL | 204.2 | 15 | 10 | 14 | 16 | 15 | 9 | 28 | 14 | 14 | 10 | 10 | 28 | 0.593 |
Danny Duffy | 2016 | AL | 179.2 | 13 | 2 | 12 | 13 | 9 | 3 | 19 | 19 | 5 | 6 | 4 | 24 | 0.500 |
John Lackey | 2013 | AL | 189.1 | 26 | 25 | 18 | 21 | 19 | 3 | 24 | 23 | 14 | 5 | 8 | 27 | 0.599 |
Kevin Gausman | 2017 | AL | 186.2 | 12 | 11 | 12 | 12 | 9 | 16 | 12 | 11 | 8 | 16 | 9 | 13 | 0.463 |
Note that players who changed teams midseason were not considered.
Evaluating the Model
The winning model produced the following confusion matrix.
Is Cy Young | Is Not Cy Young | |
---|---|---|
Is Predicted Cy Young | 21 | 1 |
Is Not Predicted Cy Young | 1 | 662 |
These results are good! The model identified 21 of 22 winners and 661 of 662 non-winners. That means the precision, recall, and F1 scores are all 0.95 out of a theoretically possible value of 1.00. The following graph shows the precision-recall curve and the area under it, with a higher area being better.
These results are much better than the MVP predictor model.
In terms of accuracy, the model missed only one award winner.
Season | League | Predicted Winner | Actual Winner | Accurate? |
---|---|---|---|---|
2007 | AL | CC Sabathia | CC Sabathia | Yes |
NL | Jake Peavy | Jake Peavy | Yes | |
2008 | AL | Cliff Lee | Cliff Lee | Yes |
NL | Tim Lincecum | Tim Lincecum | Yes | |
2009 | AL | Justin Verlander | Zack Greinke | No |
NL | Tim Lincecum | Tim Lincecum | Yes | |
2010 | AL | Felix Hernandez | Felix Hernandez | Yes |
NL | Roy Halladay | Roy Halladay | Yes | |
2011 | AL | Justin Verlander | Justin Verlander | Yes |
NL | Clayton Kershaw | Clayton Kershaw | Yes | |
2012 | AL | David Price | David Price | Yes |
NL | R.A. Dickey | R.A. Dickey | Yes | |
2013 | AL | Max Scherzer | Max Scherzer | Yes |
NL | Clayton Kershaw | Clayton Kershaw | Yes | |
2014 | AL | Corey Kluber | Corey Kluber | Yes |
NL | Clayton Kershaw | Clayton Kershaw | Yes | |
2015 | AL | Dallas Keuchel | Dallas Keuchel | Yes |
NL | Jake Arrieta | Jake Arrieta | Yes | |
2016 | AL | Rick Porcello | Rick Porcello | Yes |
NL | Max Scherzer | Max Scherzer | Yes | |
2017 | AL | Corey Kluber | Corey Kluber | Yes |
NL | Max Scherzer | Max Scherzer | Yes |
Verlander finished third in 2009, so his predicted win here isn’t out of line with reality.
What Matters in Cy Young Voting?
According to the model, the following factors are important in Cy Young voting:
Whether they know it or not, Cy Young voters prefer pitchers who excel in the FIP-based WAR we house here at FanGraphs. Now, I don’t think voters sort our leaderboards by WAR and fill out their ballots accordingly. Rather, WAR neatly comprises the aspects of pitching voters care about: striking out batters, limiting walks, and keeping the ball in the yard while throwing a lot of innings.
Pitchers with a lot of wins also do well in Cy Young voting. In other news, water is wet, but it’s always nice to see a model agree with reality. The importance of wins bodes well for Snell and Kluber, who each won 20-plus games. Good old ERA is also important, which elevates Snell’s case, as is the number of innings pitched and strikeouts. Many of these features are correlated together, but that’s why the xgboost algorithm is a good choice for this task.
Your (Predicted) 2018 Award Winners
Recall that although the Cy Young award is a winner-take-all competition, the xgboost algorithm doesn’t know that. It only gives the probability of an individual player-season taking home the award. This means the probabilities in each league above don’t add up to 100 percent. Also, the lists below are not intended to indicate predicted placement in the final results.
With that said, the model predicts Snell and Scherzer will take home the hardware:
The results seem reasonable. Snell’s 180.2 innings would be the fewest ever for a Cy Young winner, and his fWAR ranks seventh in the sample. But he ranks No. 1 in ERA, wins, and losses and fifth in strikeouts. He also ranks second (best) in home runs allowed. Unbeknownst to my model, Snell also has the power of narrative behind him. He’s a young hot prospect whose breakout year helped push a team that wasn’t supposed to contend to a 90-win season.
Verlander ranks first in WAR and strikeouts but third in ERA and sixth in wins. He also surrendered a lot of dingers, which helps put his chances behind those of rotation mate Gerrit Cole and a breakout season from Trevor Bauer. And while we know Kluber, as a finalist, will finish no worse than third, he didn’t do enough in the model’s opinion to outshine either Snell or Verlander. I think he’ll place third in real life.
In the NL, my model thinks deGrom’s stellar season is no match for Scherzer’s strikeout and win ranks. deGrom’s loss total also hurts his chances. Scherzer is also nipping at deGrom’s heels in WAR, ERA and FIP rankings. Meanwhile, like Kluber in the AL, Nola had a great season but falls well behind the other two.
That said, I think deGrom will win. His historically low ERA and FIP give voters who might otherwise prefer more wins and fewer losses a strong reason to vote for him. I’m also placing faith in voters’ abilities to recognize the team-centric nature of those two stats.
Finally, like Snell, deGrom has a good story behind him. He’s the guy who elevated the Mets above joke status this year, the guy we always knew was great but finally had a tremendous year that got everyone’s attention. It would be his first award, while Scherzer has already proven himself with three. However, the vote could be closer than some people seem to think.
I’m pleased with this model’s results. Predicting 21 of 22 Cy Young winners of the past 10 years gives me a strong amount of confidence in its 2018 predictions. We’ll know more when the results are announced soon.
It’s a bit odd to say you love your model results and then also think they will be wrong this year. I’m wondering if you have tried to use values other than straight ranks? DeGrom is considered a favourite this year because his ERA is way better than the next guys. If he had a 2.30 ERA, he’d still rank first in ERA (and possible fWAR), but I don’t think he’d have a chance at winning the Cy Young given the same record. A Z-Score type metric might work better for ERA? Or it might not, and rank would actually perform better in the model, just curious.
Thanks for the feedback. Your suggestion of z scores is great to capture the idea that voters care about the magnitude of differences between 2 players’ stats. I played around with those after reading your comment, in addition to pruning out some other meaningless variables and adjusting the algorithm slightly, and ended up with significantly improved results. deGrom’s chances improve to 50% while Scherzer’s shrink to 35 or so, making deGrom the winner in the model. Of course I couldn’t know that when I wrote this article last weekend, but in hindsight it does appear z scores provide some benefit. There is still some overfitting, but it’s not as bad. I’ll keep working to improve it.
How accurate would the model have been if you used the actual values of these statistics, rather than their rankings? Obviously, this model came out very accurate, and I’m curious how different it would have turned out.
I’m not sure, but when doing MVP predictions with this algorithm, raw values didn’t help as much as the rankings. I think rankings, or some kind of deviation from a standard value, help smooth out the fact that, say, a 3.50 ERA means something different in 2007 than it does in 2018.
how can fWAR be 100% when just last year Chris Sale led Kluber by 0.4, yet didn’t win? Or 2016 when BOTH Cy Young Winners didn’t lead in fWAR. Or 2015 with both. Or 2012 with both. Or 2011 with Halladay leading Kershaw. So just in the last 7 years- the fWAR leader in the league hasn’t won the Cy Young in 8 of the 14 situations. After tonight, it’ll probably be 9 of 16. Definitely no 100%.
“importance” is a relative measure for predictive models. The software Mr. Pollack is using apparently sets the most “important” variable to 100 – I’ve used software where the top value gets set to 1, or where the importance total has to sum to 1. It’s not saying it’s 100% predictive, just that the model would lose the most without it.
Does your list of potential winners only include “qualified” pitchers? because Chris Sale finished 2nd in fWAR, 2nd in ERA, and first in K-rate amongst starters (6th if you count all AL pitchers with at least 10 IP).
He’s not one of the finalists, so he certainly is going to remain the best pitcher in baseball without a Cy Young, but your chart of predicted winners shows more than just the finalists.
The article indicates that he set a minimum of 175 IP.
I think you’re overfitting your model and double-counting a lot of variables. fWAR is effectively FIP & IP combined, but you’ve included both in your model – don’t you think that would skew the result unfairly given that we’re taking into account both of those features, then incorporating another feature that is the product of those two features? You shouldn’t be just throwing features at a machine learning model to see whatever sticks, that’s how you get an overfitted model with poor predictive power.
Also, I don’t think that using pitchers’ rankings is effective, as it fails to distill differences between pitchers’ skills. As others have pointed out, the model will treat a first-place 1.50 ERA versus a second-place 3.00 ERA the exact same as a first-place 2.50 ERA versus a second-place 2.75 ERA, when the difference is far more substantial, especially with regards to voting.
While I generally agree with incorporating values instead of solely rankings, Park Factor would be the argument behind including both fWAR and FIP+IP.
Park Factors can only go so far. FIP+IP are majority factors in fWAR (while, technically it’s ifFIP, but the point stands).
You’re right, this article & the feedback helped me understand I was confusing the algorithm with too much data. I re-ran it using only the most important stats (from the importance graph shown in this article) as z-scores, and the results were much improved. It still over-fits a bit but not as much I don’t think.
Poor Freeland…