Predicting the 2018 Cy Young Race with Machine Learning

Max Scherzer, not Jacob deGrom, is the predicted NL Cy Young winner in this model. (via Keith Allison)

After predicting the MVP award races a few weeks ago, I have turned my attention to the Cy Young races. Who’s likely to win? What factors do voters use to decide on the award? Who won’t win but had a season worthy of consideration?

The American League finalists each have strong credentials. Justin Verlander had a traditionally dominant season, leading the league in strikeouts and pitching nearly as many innings as the leader, Corey Kluber. Blake Snell won 21 games for a surprise Tampa Bay Rays team and finished with an eye-catching 1.89 ERA. Kluber had a sensational season himself, winning 20 games and walking just four percent of the batters he faced.

The National League also features three excellent candidates. Jacob deGrom finished with the second-best FIP and fourth-best ERA of any qualified starter in the past 30 years. He, or rather his teammates, also made headlines for not scoring many runs while he was on the mound. Max Scherzer turned in another incredible season in what likely is a Hall of Fame career. Gunning for his third straight Cy Young award and fourth overall, the Nationals righty struck out 300 batters and led the NL in innings. Rounding out the top three candidates is a breakout season from the Phillies’ Aaron Nola.

Building the Model

To predict the winners, I once again turned to the xgboost algorithm supported by the caret package. I took data from starters who pitched at least 175 innings (so I could analyze Snell), transformed most of the raw data into league rankings, and used 10-fold cross-validation across a wide range of tuning parameters to find the model that produced the best results. As I did when predicting MVP winners, I evaluated each model using precision, recall, F1, and area under the precision-recall curve. These metrics reflect the rarity of award winners in a large data set. I also subjectively looked at the predictions to ensure they weren’t off base.

I gave the following data points to the xgboost algorithm:

  • Season: to understand whether voting trends change over time
  • League: to understand whether voters treat AL and NL pitchers differently
  • Team Winning Percentage: to understand whether voters care about how well the player’s team performed
  • Innings Pitched
  • The player’s league ranking in the following stats:
    • Wins
    • Losses (where a higher ranking indicates fewer losses)
    • WAR (the FIP-based variety)
    • ERA
    • FIP
    • Strikeouts
    • Walks allowed
    • HR allowed
    • K%
    • BB%
    • K-BB%
    • HR/9

To capture current voting trends, I looked at data from the 2007 through ’17 seasons. I ignored relievers, whom I defined as pitchers who started fewer than 50 percent of the games they appeared in. Like starting pitchers winning an MVP award, relievers win a Cy Young award so rarely that including them isn’t worth the effort.

Here are five random rows from the training data set, with players’ names added for clarity.

Cy Young Predictor Training Data
Player Season League IP Win Rank Loss Rank ERA Rank FIP Rank K Rank BB Rank HR Rank fWAR Rank K% Rank BB% Rank K-BB% Rank HR/9 Rank Team Winning %
Justin Verlander 2010 AL 224.1 4 7 9 2 4 21 3 1 4 16 5 3 0.500
Mike Minor 2013 NL 204.2 15 10 14 16 15 9 28 14 14 10 10 28 0.593
Danny Duffy 2016 AL 179.2 13 2 12 13 9 3 19 19 5 6 4 24 0.500
John Lackey 2013 AL 189.1 26 25 18 21 19 3 24 23 14 5 8 27 0.599
Kevin Gausman 2017 AL 186.2 12 11 12 12 9 16 12 11 8 16 9 13 0.463

Note that players who changed teams midseason were not considered.

Evaluating the Model

The winning model produced the following confusion matrix.

Cy Young Predictor Confusion Matrix
Is Cy Young Is Not Cy Young
Is Predicted Cy Young 21 1
Is Not Predicted Cy Young 1 662

These results are good! The model identified 21 of 22 winners and 661 of 662 non-winners. That means the precision, recall, and F1 scores are all 0.95 out of a theoretically possible value of 1.00. The following graph shows the precision-recall curve and the area under it, with a higher area being better.

These results are much better than the MVP predictor model.

In terms of accuracy, the model missed only one award winner.

Predicted and Actual Cy Young Winners, 2007–2017
Season League Predicted Winner Actual Winner Accurate?
2007 AL CC Sabathia CC Sabathia Yes
NL Jake Peavy Jake Peavy Yes
2008 AL Cliff Lee Cliff Lee Yes
NL Tim Lincecum Tim Lincecum Yes
2009 AL Justin Verlander Zack Greinke No
NL Tim Lincecum Tim Lincecum Yes
2010 AL Felix Hernandez Felix Hernandez Yes
NL Roy Halladay Roy Halladay Yes
2011 AL Justin Verlander Justin Verlander Yes
NL Clayton Kershaw Clayton Kershaw Yes
2012 AL David Price David Price Yes
NL R.A. Dickey R.A. Dickey Yes
2013 AL Max Scherzer Max Scherzer Yes
NL Clayton Kershaw Clayton Kershaw Yes
2014 AL Corey Kluber Corey Kluber Yes
NL Clayton Kershaw Clayton Kershaw Yes
2015 AL Dallas Keuchel Dallas Keuchel Yes
NL Jake Arrieta Jake Arrieta Yes
2016 AL Rick Porcello Rick Porcello Yes
NL Max Scherzer Max Scherzer Yes
2017 AL Corey Kluber Corey Kluber Yes
NL Max Scherzer Max Scherzer Yes

Verlander finished third in 2009, so his predicted win here isn’t out of line with reality.

What Matters in Cy Young Voting?

According to the model, the following factors are important in Cy Young voting:

Whether they know it or not, Cy Young voters prefer pitchers who excel in the FIP-based WAR we house here at FanGraphs. Now, I don’t think voters sort our leaderboards by WAR and fill out their ballots accordingly. Rather, WAR neatly comprises the aspects of pitching voters care about: striking out batters, limiting walks, and keeping the ball in the yard while throwing a lot of innings.

Pitchers with a lot of wins also do well in Cy Young voting. In other news, water is wet, but it’s always nice to see a model agree with reality. The importance of wins bodes well for Snell and Kluber, who each won 20-plus games. Good old ERA is also important, which elevates Snell’s case, as is the number of innings pitched and strikeouts. Many of these features are correlated together, but that’s why the xgboost algorithm is a good choice for this task.

Your (Predicted) 2018 Award Winners

Recall that although the Cy Young award is a winner-take-all competition, the xgboost algorithm doesn’t know that. It only gives the probability of an individual player-season taking home the award. This means the probabilities in each league above don’t add up to 100 percent. Also, the lists below are not intended to indicate predicted placement in the final results.

With that said, the model predicts Snell and Scherzer will take home the hardware:

The results seem reasonable. Snell’s 180.2 innings would be the fewest ever for a Cy Young winner, and his fWAR ranks seventh in the sample. But he ranks No. 1 in ERA, wins, and losses and fifth in strikeouts. He also ranks second (best) in home runs allowed. Unbeknownst to my model, Snell also has the power of narrative behind him. He’s a young hot prospect whose breakout year helped push a team that wasn’t supposed to contend to a 90-win season.

Verlander ranks first in WAR and strikeouts but third in ERA and sixth in wins. He also surrendered a lot of dingers, which helps put his chances behind those of rotation mate Gerrit Cole and a breakout season from Trevor Bauer. And while we know Kluber, as a finalist, will finish no worse than third, he didn’t do enough in the model’s opinion to outshine either Snell or Verlander. I think he’ll place third in real life.

In the NL, my model thinks deGrom’s stellar season is no match for Scherzer’s strikeout and win ranks. deGrom’s loss total also hurts his chances. Scherzer is also nipping at deGrom’s heels in WAR, ERA and FIP rankings. Meanwhile, like Kluber in the AL, Nola had a great season but falls well behind the other two.

That said, I think deGrom will win. His historically low ERA and FIP give voters who might otherwise prefer more wins and fewer losses a strong reason to vote for him. I’m also placing faith in voters’ abilities to recognize the team-centric nature of those two stats.

Finally, like Snell, deGrom has a good story behind him. He’s the guy who elevated the Mets above joke status this year, the guy we always knew was great but finally had a tremendous year that got everyone’s attention. It would be his first award, while Scherzer has already proven himself with three. However, the vote could be closer than some people seem to think.

I’m pleased with this model’s results. Predicting 21 of 22 Cy Young winners of the past 10 years gives me a strong amount of confidence in its 2018 predictions. We’ll know more when the results are announced soon.

Ryan enjoys characterizing that elusive line between luck and skill in baseball. For more, subscribe to his articles and follow him on Twitter.
Newest Most Voted
Inline Feedbacks
View all comments
5 years ago

It’s a bit odd to say you love your model results and then also think they will be wrong this year. I’m wondering if you have tried to use values other than straight ranks? DeGrom is considered a favourite this year because his ERA is way better than the next guys. If he had a 2.30 ERA, he’d still rank first in ERA (and possible fWAR), but I don’t think he’d have a chance at winning the Cy Young given the same record. A Z-Score type metric might work better for ERA? Or it might not, and rank would actually perform better in the model, just curious.

5 years ago

How accurate would the model have been if you used the actual values of these statistics, rather than their rankings? Obviously, this model came out very accurate, and I’m curious how different it would have turned out.

5 years ago

how can fWAR be 100% when just last year Chris Sale led Kluber by 0.4, yet didn’t win? Or 2016 when BOTH Cy Young Winners didn’t lead in fWAR. Or 2015 with both. Or 2012 with both. Or 2011 with Halladay leading Kershaw. So just in the last 7 years- the fWAR leader in the league hasn’t won the Cy Young in 8 of the 14 situations. After tonight, it’ll probably be 9 of 16. Definitely no 100%.

5 years ago
Reply to  stever20

“importance” is a relative measure for predictive models. The software Mr. Pollack is using apparently sets the most “important” variable to 100 – I’ve used software where the top value gets set to 1, or where the importance total has to sum to 1. It’s not saying it’s 100% predictive, just that the model would lose the most without it.

5 years ago

Does your list of potential winners only include “qualified” pitchers? because Chris Sale finished 2nd in fWAR, 2nd in ERA, and first in K-rate amongst starters (6th if you count all AL pitchers with at least 10 IP).

He’s not one of the finalists, so he certainly is going to remain the best pitcher in baseball without a Cy Young, but your chart of predicted winners shows more than just the finalists.

5 years ago
Reply to  MikeS

The article indicates that he set a minimum of 175 IP.

John Edwardsmember
5 years ago

I think you’re overfitting your model and double-counting a lot of variables. fWAR is effectively FIP & IP combined, but you’ve included both in your model – don’t you think that would skew the result unfairly given that we’re taking into account both of those features, then incorporating another feature that is the product of those two features? You shouldn’t be just throwing features at a machine learning model to see whatever sticks, that’s how you get an overfitted model with poor predictive power.

Also, I don’t think that using pitchers’ rankings is effective, as it fails to distill differences between pitchers’ skills. As others have pointed out, the model will treat a first-place 1.50 ERA versus a second-place 3.00 ERA the exact same as a first-place 2.50 ERA versus a second-place 2.75 ERA, when the difference is far more substantial, especially with regards to voting.

David Ducksworth
5 years ago
Reply to  John Edwards

While I generally agree with incorporating values instead of solely rankings, Park Factor would be the argument behind including both fWAR and FIP+IP.

John Edwardsmember
5 years ago

Park Factors can only go so far. FIP+IP are majority factors in fWAR (while, technically it’s ifFIP, but the point stands).

Richard Bergstrom
5 years ago

Poor Freeland…