Predicting the 2018 Cy Young Race with Machine Learning

by Jimbob Sweatypants
November 14, 2018

Max Scherzer, not Jacob deGrom, is the predicted NL Cy Young winner in this model. (via Keith Allison)

After predicting the MVP award races a few weeks ago, I have turned my attention to the Cy Young races. Who’s likely to win? What factors do voters use to decide on the award? Who won’t win but had a season worthy of consideration?

The American League finalists each have strong credentials. Justin Verlander had a traditionally dominant season, leading the league in strikeouts and pitching nearly as many innings as the leader, Corey Kluber. Blake Snell won 21 games for a surprise Tampa Bay Rays team and finished with an eye-catching 1.89 ERA. Kluber had a sensational season himself, winning 20 games and walking just four percent of the batters he faced.

The National League also features three excellent candidates. Jacob deGrom finished with the second-best FIP and fourth-best ERA of any qualified starter in the past 30 years. He, or rather his teammates, also made headlines for not scoring many runs while he was on the mound. Max Scherzer turned in another incredible season in what likely is a Hall of Fame career. Gunning for his third straight Cy Young award and fourth overall, the Nationals righty struck out 300 batters and led the NL in innings. Rounding out the top three candidates is a breakout season from the Phillies’ Aaron Nola.

Building the Model

To predict the winners, I once again turned to the xgboost algorithm supported by the caret package. I took data from starters who pitched at least 175 innings (so I could analyze Snell), transformed most of the raw data into league rankings, and used 10-fold cross-validation across a wide range of tuning parameters to find the model that produced the best results. As I did when predicting MVP winners, I evaluated each model using precision, recall, F1, and area under the precision-recall curve. These metrics reflect the rarity of award winners in a large data set. I also subjectively looked at the predictions to ensure they weren’t off base.

I gave the following data points to the xgboost algorithm:

Season: to understand whether voting trends change over time
League: to understand whether voters treat AL and NL pitchers differently
Team Winning Percentage: to understand whether voters care about how well the player’s team performed
Innings Pitched
The player’s league ranking in the following stats:
- Wins
- Losses (where a higher ranking indicates fewer losses)
- WAR (the FIP-based variety)
- ERA
- FIP
- Strikeouts
- Walks allowed
- HR allowed
- K%
- BB%
- K-BB%
- HR/9

To capture current voting trends, I looked at data from the 2007 through ’17 seasons. I ignored relievers, whom I defined as pitchers who started fewer than 50 percent of the games they appeared in. Like starting pitchers winning an MVP award, relievers win a Cy Young award so rarely that including them isn’t worth the effort.

Here are five random rows from the training data set, with players’ names added for clarity.

Cy Young Predictor Training Data

Player	Season	League	IP	Win Rank	Loss Rank	ERA Rank	FIP Rank	K Rank	BB Rank	HR Rank	fWAR Rank	K% Rank	BB% Rank	K-BB% Rank	HR/9 Rank	Team Winning %
Justin Verlander	2010	AL	224.1	4	7	9	2	4	21	3	1	4	16	5	3	0.500
Mike Minor	2013	NL	204.2	15	10	14	16	15	9	28	14	14	10	10	28	0.593
Danny Duffy	2016	AL	179.2	13	2	12	13	9	3	19	19	5	6	4	24	0.500
John Lackey	2013	AL	189.1	26	25	18	21	19	3	24	23	14	5	8	27	0.599
Kevin Gausman	2017	AL	186.2	12	11	12	12	9	16	12	11	8	16	9	13	0.463

Note that players who changed teams midseason were not considered.

Evaluating the Model

The winning model produced the following confusion matrix.

Cy Young Predictor Confusion Matrix

	Is Cy Young	Is Not Cy Young
Is Predicted Cy Young	21	1
Is Not Predicted Cy Young	1	662

These results are good! The model identified 21 of 22 winners and 661 of 662 non-winners. That means the precision, recall, and F1 scores are all 0.95 out of a theoretically possible value of 1.00. The following graph shows the precision-recall curve and the area under it, with a higher area being better.

These results are much better than the MVP predictor model.

In terms of accuracy, the model missed only one award winner.

Predicted and Actual Cy Young Winners, 2007–2017

Season	League	Predicted Winner	Actual Winner	Accurate?
2007	AL	CC Sabathia	CC Sabathia	Yes
	NL	Jake Peavy	Jake Peavy	Yes
2008	AL	Cliff Lee	Cliff Lee	Yes
	NL	Tim Lincecum	Tim Lincecum	Yes
2009	AL	Justin Verlander	Zack Greinke	No
	NL	Tim Lincecum	Tim Lincecum	Yes
2010	AL	Felix Hernandez	Felix Hernandez	Yes
	NL	Roy Halladay	Roy Halladay	Yes
2011	AL	Justin Verlander	Justin Verlander	Yes
	NL	Clayton Kershaw	Clayton Kershaw	Yes
2012	AL	David Price	David Price	Yes
	NL	R.A. Dickey	R.A. Dickey	Yes
2013	AL	Max Scherzer	Max Scherzer	Yes
	NL	Clayton Kershaw	Clayton Kershaw	Yes
2014	AL	Corey Kluber	Corey Kluber	Yes
	NL	Clayton Kershaw	Clayton Kershaw	Yes
2015	AL	Dallas Keuchel	Dallas Keuchel	Yes
	NL	Jake Arrieta	Jake Arrieta	Yes
2016	AL	Rick Porcello	Rick Porcello	Yes
	NL	Max Scherzer	Max Scherzer	Yes
2017	AL	Corey Kluber	Corey Kluber	Yes
	NL	Max Scherzer	Max Scherzer	Yes

Verlander finished third in 2009, so his predicted win here isn’t out of line with reality.

What Matters in Cy Young Voting?

According to the model, the following factors are important in Cy Young voting:

Whether they know it or not, Cy Young voters prefer pitchers who excel in the FIP-based WAR we house here at FanGraphs. Now, I don’t think voters sort our leaderboards by WAR and fill out their ballots accordingly. Rather, WAR neatly comprises the aspects of pitching voters care about: striking out batters, limiting walks, and keeping the ball in the yard while throwing a lot of innings.

Pitchers with a lot of wins also do well in Cy Young voting. In other news, water is wet, but it’s always nice to see a model agree with reality. The importance of wins bodes well for Snell and Kluber, who each won 20-plus games. Good old ERA is also important, which elevates Snell’s case, as is the number of innings pitched and strikeouts. Many of these features are correlated together, but that’s why the xgboost algorithm is a good choice for this task.

Your (Predicted) 2018 Award Winners

Recall that although the Cy Young award is a winner-take-all competition, the xgboost algorithm doesn’t know that. It only gives the probability of an individual player-season taking home the award. This means the probabilities in each league above don’t add up to 100 percent. Also, the lists below are not intended to indicate predicted placement in the final results.

With that said, the model predicts Snell and Scherzer will take home the hardware:

The results seem reasonable. Snell’s 180.2 innings would be the fewest ever for a Cy Young winner, and his fWAR ranks seventh in the sample. But he ranks No. 1 in ERA, wins, and losses and fifth in strikeouts. He also ranks second (best) in home runs allowed. Unbeknownst to my model, Snell also has the power of narrative behind him. He’s a young hot prospect whose breakout year helped push a team that wasn’t supposed to contend to a 90-win season.

Verlander ranks first in WAR and strikeouts but third in ERA and sixth in wins. He also surrendered a lot of dingers, which helps put his chances behind those of rotation mate Gerrit Cole and a breakout season from Trevor Bauer. And while we know Kluber, as a finalist, will finish no worse than third, he didn’t do enough in the model’s opinion to outshine either Snell or Verlander. I think he’ll place third in real life.

In the NL, my model thinks deGrom’s stellar season is no match for Scherzer’s strikeout and win ranks. deGrom’s loss total also hurts his chances. Scherzer is also nipping at deGrom’s heels in WAR, ERA and FIP rankings. Meanwhile, like Kluber in the AL, Nola had a great season but falls well behind the other two.

That said, I think deGrom will win. His historically low ERA and FIP give voters who might otherwise prefer more wins and fewer losses a strong reason to vote for him. I’m also placing faith in voters’ abilities to recognize the team-centric nature of those two stats.

Finally, like Snell, deGrom has a good story behind him. He’s the guy who elevated the Mets above joke status this year, the guy we always knew was great but finally had a tremendous year that got everyone’s attention. It would be his first award, while Scherzer has already proven himself with three. However, the vote could be closer than some people seem to think.

I’m pleased with this model’s results. Predicting 21 of 22 Cy Young winners of the past 10 years gives me a strong amount of confidence in its 2018 predictions. We’ll know more when the results are announced soon.

Ryan enjoys characterizing that elusive line between luck and skill in baseball. For more, subscribe to his articles and follow him on Twitter.

13 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

aweb

6 years ago

It’s a bit odd to say you love your model results and then also think they will be wrong this year. I’m wondering if you have tried to use values other than straight ranks? DeGrom is considered a favourite this year because his ERA is way better than the next guys. If he had a 2.30 ERA, he’d still rank first in ERA (and possible fWAR), but I don’t think he’d have a chance at winning the Cy Young given the same record. A Z-Score type metric might work better for ERA? Or it might not, and rank would actually perform better in the model, just curious.

Ryan Pollackmember

6 years ago

Reply to aweb

Thanks for the feedback. Your suggestion of z scores is great to capture the idea that voters care about the magnitude of differences between 2 players’ stats. I played around with those after reading your comment, in addition to pruning out some other meaningless variables and adjusting the algorithm slightly, and ended up with significantly improved results. deGrom’s chances improve to 50% while Scherzer’s shrink to 35 or so, making deGrom the winner in the model. Of course I couldn’t know that when I wrote this article last weekend, but in hindsight it does appear z scores provide some benefit. There is still some overfitting, but it’s not as bad. I’ll keep working to improve it.

TJmember

6 years ago

How accurate would the model have been if you used the actual values of these statistics, rather than their rankings? Obviously, this model came out very accurate, and I’m curious how different it would have turned out.

Ryan Pollackmember

6 years ago

Reply to TJ

I’m not sure, but when doing MVP predictions with this algorithm, raw values didn’t help as much as the rankings. I think rankings, or some kind of deviation from a standard value, help smooth out the fact that, say, a 3.50 ERA means something different in 2007 than it does in 2018.

stever20member

6 years ago

how can fWAR be 100% when just last year Chris Sale led Kluber by 0.4, yet didn’t win? Or 2016 when BOTH Cy Young Winners didn’t lead in fWAR. Or 2015 with both. Or 2012 with both. Or 2011 with Halladay leading Kershaw. So just in the last 7 years- the fWAR leader in the league hasn’t won the Cy Young in 8 of the 14 situations. After tonight, it’ll probably be 9 of 16. Definitely no 100%.

-2

aweb

6 years ago

Reply to stever20

“importance” is a relative measure for predictive models. The software Mr. Pollack is using apparently sets the most “important” variable to 100 – I’ve used software where the top value gets set to 1, or where the importance total has to sum to 1. It’s not saying it’s 100% predictive, just that the model would lose the most without it.

MikeSmember

6 years ago

Does your list of potential winners only include “qualified” pitchers? because Chris Sale finished 2nd in fWAR, 2nd in ERA, and first in K-rate amongst starters (6th if you count all AL pitchers with at least 10 IP).

He’s not one of the finalists, so he certainly is going to remain the best pitcher in baseball without a Cy Young, but your chart of predicted winners shows more than just the finalists.

-3

LGM2K18

6 years ago

Reply to MikeS

The article indicates that he set a minimum of 175 IP.

John Edwardsmember

6 years ago

I think you’re overfitting your model and double-counting a lot of variables. fWAR is effectively FIP & IP combined, but you’ve included both in your model – don’t you think that would skew the result unfairly given that we’re taking into account both of those features, then incorporating another feature that is the product of those two features? You shouldn’t be just throwing features at a machine learning model to see whatever sticks, that’s how you get an overfitted model with poor predictive power.

Also, I don’t think that using pitchers’ rankings is effective, as it fails to distill differences between pitchers’ skills. As others have pointed out, the model will treat a first-place 1.50 ERA versus a second-place 3.00 ERA the exact same as a first-place 2.50 ERA versus a second-place 2.75 ERA, when the difference is far more substantial, especially with regards to voting.

David Ducksworth

6 years ago

Reply to John Edwards

While I generally agree with incorporating values instead of solely rankings, Park Factor would be the argument behind including both fWAR and FIP+IP.

John Edwardsmember

6 years ago

Reply to David Ducksworth

Park Factors can only go so far. FIP+IP are majority factors in fWAR (while, technically it’s ifFIP, but the point stands).

Ryan Pollackmember

6 years ago

Reply to John Edwards

You’re right, this article & the feedback helped me understand I was confusing the algorithm with too much data. I re-ran it using only the most important stats (from the importance graph shown in this article) as z-scores, and the results were much improved. It still over-fits a bit but not as much I don’t think.

Richard Bergstrom

6 years ago

Poor Freeland…

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG