KATOH Goes to College: Projecting College Hitters
If you’ve read any of my articles at FanGraphs or The Hardball Times, you’re probably at least vaguely familiar with KATOH: The methodology I use to project minor league players’ big league performances using their minor league stats. Essentially, KATOH looks at historical minor leaguers to determine how predictive their minor league numbers are of big league success. Today, I’m furthering my horizons, and expanding KATOH to include college baseball, just in time for the upcoming 2015 amateur draft.
Just like the other KATOH models, this one is far from perfect. Stats are only a piece of the player evaluation puzzle, especially for hitters who are only 18-22 years old and still learning the ins and outs of baseball. Things are muddied even further by the varying quality of competition at the college level. There are certainly some college players whose talent approximates that of big leaguers, but there are many more who are nowhere close. Factors like this make a stats-based analysis of college performances dicey. Nonetheless, I came across some metrics that appear to hold some predictive value for players at the college level.
The next few paragraphs will be dedicated to the rigorous math behind my work (some might call it gory). If that’s not your cup of tea, you can skip ahead to the paragraph that starts with “Now that I’ve hashed….” That’s where I actually start to talk about baseball players, and give my model’s forecasts for players expected to go early in next week’s draft.
Just as I did with my previous work, I developed my forecasts by running a series of probit regression analyses. I’m often asked why I use a probit regression in these analyses, rather than the more well-known logistical regression. In the past, I didn’t really have much of a reason, other than probit was what I learned. However, this time, I tested both approaches, and found that the probit gave me slightly lower Akaike information criterions (AIC) nearly across the board. For this reason, I opted to stick with the probit. However, the two models are nearly identical in practice. Anyhow, these are the thresholds I chose to use in my analysis, regarding a hitter’s performance through age 28.
- Playing in the majors (at least one game)
- >1 WAR
- >2 WAR
- >3 WAR
- >4 WAR
- >5 WAR
- >6 WAR
- >7 WAR
- >8 WAR
- >9 WAR
- >10 WAR
When I developed my KATOH models for minor leaguers, I built a different set of regression models for each level of the affiliated minor leagues. In other words, I built my models for Triple-A, then built another set for Double-A, and so on. Taking it level-by-level allowed me to pick up on some of the nuances of each individual level. For example, a hitter’s walk rate is fairly predictive in the high minors, but tells us little to nothing at the lower levels. A similar phenomenon takes place with minor league pitchers.
My original intention was to employ a similar methodology with the various college baseball conferences. Just as I built a set of models for each level of the minors, I would build a set of models for each college conference to pick up on any nuances of a specific conference. Unfortunately, the data available to me made this rather difficult. My source for these college stats — Chris Long’s database — had data going back only to 2002. Plus, it’s too soon to do much with the last few years of college data since the jury’s still out on a player who’s still in his mid-20s. That left me with only a few years of college data to play with.
As if that weren’t enough, only a very small fraction of college players go on to have major league success. For most conferences, there have been only a few players since 2002 who went on to be impact big leaguers. Naturally, it would have been dicey trying to forecast something that’s happened only a few times in my entire sample of data.
Instead, I took an alternative — and less time-consuming — approach of including a player’s conference as a categorical variable in my model. Basically, each player receives a probability (well, technically he receives a Z-score) based on his stats, and this value is then adjusted based on his conference. Due to data constraints, I excluded the conferences that have producedonly a handful of big leaguers over the time period I examined. Below are the conferences I included, ranked from highest to lowest using the coefficients my regression model spat out for said conferences in my “Making the Majors” model.
- Pac-12 Conference
- Southeastern Conference (SEC)
- Big 12 Conference
- Atlantic Coast Conference (ACC)
- Big West Conference
- West Coast Conference (WCC)
- Conference USA
- Mountain West Conference
- Sun Belt Conference
- American Athletic Conference (AAC)
- Big Ten Conference (B1G)
- Big South Conference
- Atlantic Sun Conference
- Colonial Athletic Association Conference (CAA)
This data don’t say that the Pac-12 conference is the best college baseball conference. In fact, this is answering a different question entirely. All this says is that given a player with identical numbers from every conference, the one coming from the Pac-12 has the best shot at playing in the majors.
In the past, my KATOH analyses have revolved around five offensive metrics: Walk rate, strikeout rate, isolated power, batting average on balls in play and stolen base frequency. This time, I tried mixing things up a little bit, and found a different combination that gave me a better fit. As a result, the variables listed below look a little different than the ones I’ve used historically.
Fortunately, the data set I used (courtesy of Chris Long) included a position for each college player, which enabled me to account for a player’s defensive position in my analysis. This isn’t something I was able to do with my previous models due to the difficulty of collecting minor league defensive data. (If anyone has access to such data, or has any ideas for easily collecting it for historical minor leaguers, please let me know!)
Here’s the run-down of the variables that made it into my model. All of these variables were adjusted for conference-wide average, and regressed toward that average based on sample size. I did not test my models out of sample due to sample size limitations. However, all of my included variables tested as statistically significant, and had coefficients in the directions you’d expect.
KATOH College Model Variables |
---|
Variable | Definition |
BB% | BB/PA |
K% | K/PA |
1B% | 1B/PA |
2B% | 2B/PA |
3B% | 3B/PA |
HR% | HR/PA |
SB% | SB/(1B+BB+HBP) |
Conference | Hitter’s team’s athletic conference |
Class year | Hitter’s listed class year |
Position | Player’s listed defensive position |
So those are the variables that made it into my model. These are important to know, of course, but a straight list of variables tells only part of the story. It’s also important to know how each of these variables plays a role in predicting players’ future production.
Let’s take a look at the offensive stats I included in my model to see how they influence a player’s big league success. The graphic below plots some of my models’ coefficients across my entire spectrum of WAR thresholds. Basically, the further from zero a metric is, the more important it is. The metrics above zero — walks, singles, doubles, triples, homers and steals — are all good. A high strikeout rate is bad, which is why it comes in below zero.
Based on these data, the ability to hit for doubles power is the most important predictor of a player’s making the majors or earning a few WAR. However, it’s interesting to see that walk rate and single rate both gain a good amount of steam at the high end of the spectrum. This implies that hitters who get on base often in college tend to have higher upsides — but possibly lower floors — than hitters who don’t.
Now, let’s break things up by position. Going by the positions listed in Chris Long’s data set, here’s a look at how a player’s defensive position influences his probability of having success in the majors. I normalized these coefficients to the group of hitters who did not have a position listed in the data set. In other words, a player with no listed position would come in at “0” across the board. The chart below reflects the other positions’ distance from this group. Keep in mind that these coefficients show the effect of a player’s defensive position after accounting for all of the other variables listed above.
Surely, you noticed that a few of the positions listed — second base, designated hitter and first base — suddenly drop off the table at a certain point. This is because there were exactly zero players in my analysis who played those positions and also managed to surpass more than a few WAR. In my models’ eyes, this means that players who man those positions in college have almost no chance of becoming anything more than role players in the majors. However, just because something didn’t happen in my sample of a few years, that’s not to say it can’t ever happen. There most certainly can be a Black Swan college second baseman worth more than a few WAR, but they just don’t come around very often. Note, that Dustin Pedroia was listed as an “infielder” in the data set, as he spent significant time playing shortstop in college.
Interestingly, both second basemen and designated hitters come in near the top of the spectrum when it comes to making it to the majors. So while they rarely turn into stars, players of this caliber appear to have relatively high floors. On the other hand, if you’re looking for an upside college bat in the draft, selecting a catcher, shortstop or third baseman seems like your best bet.
Finally, let’s look at a player’s class year. I normalized these coefficients to college freshmen. In other words, a freshman hitter would come in at “0” across the board. The chart below reflects the other classes’ distance from this group.
Nothing all too surprising here. Just as with minor league prospects, a standout season from a college prospect is more impressive the younger he is. What’s really striking is how much college seniors get dinged relative to their underclassman counterparts, which isn’t all that surprising. Generally speaking, pretty much all of the best college players get drafted after their junior year. So if a hitter sticks around for his senior season, there’s likely a reason for it. More likely than not, that reason is that he’s not very good.
Now that I’ve hashed out all of my technical mumbo-jumbo, I’m finally ready to apply all of this math to this year’s crop of college hitters. Below, you’ll find the KATOH projections for the college hitters who are included in Kiley McDaniel’s handy (and sortable!) draft board. In a perfect world, I would have generated predictions for all current college players. However, Chris Long’s database does not include 2015 stats, and the data available from the team and conference websites does not include all of the data I’d need in a readable format. Nonetheless, I plan to gather all of the necessary data for the hitters taken in the first several rounds of the draft, and have forecasts for these players next week. In the mean time, let’s scout some stat lines!
KATOH 2015 College Hitter Projections |
---|
WAR | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Kiley Rank | Player | Pos | MLB | >1 | >2 | >3 | >4 | >5 | >6 | >7 | >8 | >9 | >10 | Thru 28 |
1 | Dansby Swanson | SS | 88% | 67% | 60% | 50% | 44% | 40% | 38% | 36% | 41% | 33% | 29% | 6.1 |
3 | Alex Bregman | SS | 97% | 92% | 92% | 91% | 87% | 81% | 81% | 77% | 81% | 75% | 58% | 11.2 |
8 | Andrew Benintendi | OF | 81% | 80% | 71% | 74% | 75% | 70% | 60% | 49% | 48% | 48% | 44% | 8.6 |
9 | Ian Happ | 2B | 23% | 8% | 7% | 8% | 4% | 4% | 5% | 7% | 8% | 0% | 7% | 1.0 |
18 | Kevin Newman | SS | 77% | 63% | 65% | 65% | 57% | 52% | 37% | 30% | 25% | 21% | 21% | 5.6 |
28 | Donnie Dewees | OF | 63% | 44% | 25% | 22% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 1.2 |
32 | D.J. Stewart | OF | 15% | 13% | 13% | 11% | 12% | 13% | 7% | 8% | 5% | 3% | 3% | 1.1 |
33 | Scott Kingery | 2B | 51% | 32% | 17% | 13% | 10% | 10% | 4% | 2% | 2% | 0% | 5% | 1.4 |
36 | Mikey White | SS | 47% | 21% | 11% | 6% | 4% | 4% | 3% | 3% | 5% | 3% | 3% | 1.0 |
39 | Richie Martin | SS | 22% | 14% | 10% | 10% | 16% | 15% | 10% | 5% | 5% | 4% | 5% | 1.3 |
41 | Blake Trahan | SS | 5% | 5% | 5% | 10% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0.2 |
49 | Joe McCarthy | OF | 1% | 1% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0.0 |
54 | Chris Shaw | 1B | 19% | 8% | 7% | 1% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0.3 |
63 | Taylor Ward | C | 6% | 1% | 1% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0.0 |
67 | Kyle Holder | SS | 15% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0.1 |
74 | Harrison Bader | OF | 11% | 3% | 2% | 1% | 1% | 1% | 0% | 0% | 0% | 0% | 0% | 0.1 |
Unsurprisingly, the most highly-touted hitters get the best projections. Dansby Swanson, and Andrew Benintendi are each expected to go in the early part of the first round, and are projected to be a useful major leaguers. Note, that I treated Ian Happ and Scott King as infielders rather than second basemen. Otherwise, my model basically gave them no chance of earning even two WAR.
So that’s college KATOH for you. This isn’t to say that all, or even most, of the hitters who earn good forecasts will be useful big leaguers. It doesn’t even mean they’ll be good minor leaguers. It’s merely saying that players who excel in the same areas go on to do good things in the majors more often than those who don’t. In my next post, I’ll break down college pitchers.
References and Resources
- Chris Long’s database of college statistics
- Special thanks to Jeff Zimmerman and Peter Melgren for helping me gather the data and link to players’ major league stats
- Special thanks to Alex Chamberlain for helping out on the data analysis side of things
Great read!
This is great. Will we get to see any historical college KATOH forecasts similar to your other work? Would love to see the biggest hits/misses.
Yeah, I can do that. For now, I’ll be focusing on the current players, which will take up a lot of my time. But I’d like to do a “hits and misses” piece at some point.
Awesome stuff. I do wish you would look at offensive WAR, not overall WAR, however. Very hard to compare the offensive value of a SS vs. an OF when they have vastly different (and unknown to the reader) defensive values baked in.
Other thoughts:
Did you have access to age, and if so, did it help?
Boyd’s World (http://www.boydsworld.com/data/) has park factors going back to 1999. Might be interesting to test.
Also, this is assuming these guys will stay at their positions. There are concerns about Bregman being a SS, there is a chance that he would move over to 2B
I did not have access to age data, unfortunately, but I think class year acts as a very good proxy for it.
Can you clarify if you are using just this season’s college stats, or some (possibly weighted) average of his college career stats? I would think a weighted average career would be the right approach, but not clear what you chose to do. Thanks.
This was based solely on (regressed) single season stats. I’ve considered doing some sort of weighted average, but am mulling over how to go about it since not all players have multiple years of data.
Hi Chris,
Very interesting stuff–I’ve followed your work KATOH work on FanGraphs, and I think that projecting prospect performance is a fascinating endeavor. Your use of ordered probit models, rather than a straight OLS regression, is also a solid innovation–props for doing all of the grunt work and analysis necessary to construct something like this!
That said, I have a few questions/comments:
1. As with the original KATOH (and as others have said), I tend to think that using WAR as the dependent variable instead of just oWAR basically just introduces noise into the model–speed isn’t a strong enough proxy for defensive ability, and it confuses the question your model is best at answering, i.e., ‘Can this guy actually hit in the majors?’ I think the addition of the positional fixed effects is a big step forward, and makes WAR a less-imperfect fit, but I’d still rather have a more accurate model that answers a less-ambitious question (future hitter quality rather than future player quality) instead of a less-accurate one that attempts to account for every aspect of player quality.
2. I think the fixed effect for athletic conference is a neat way to try to account for quality of opposition–kudos!
3. How large was the sample size for this analysis? This ties into my comment below . . .
4. I’m a bit skeptical of over-fitting, given the (probably) limited sample size of college hitters who’ve also played in the MLB during this time period. This is primarily directed at using 1B%, 2B%, and 3B% instead of ISO/ISOCON/BABIP or something else. I’d be curious to know what the rationale (better T scores? ISO just not being significant?) was.
5. How accurate is the model? I know that’s a bit of a tricky question for probit and logit models given that AIC is harder to interpret than r-squared and RMSE, but it would be interesting to see how much better the model performs than one that simply ranks players by their draft position (or an aggregate of their pre-draft mock draft ranking).
Best,
Joshua
Glad you’ve enjoyed it! Always good to hear constructive criticisms.
1. That’s a very fair point, and it’s something I’ve thought about a lot. The issue with zeroing in on offensive WAR, though, is that a players offensive opportunity is often influenced by his position/defense. For example, if given the same offensive ability, a shortstop is much more likely to be called up (and to stick around) to the majors than a first baseman. I agree that speed is/was a weak proxy for defense, but I think that defensive position + speed gets at a player’s defensive skill pretty well.
2. Thanks!
3. My sample was 6,614 and 452 played in the MLB
4. I chose to break them out because it gave me a better fit than using ISO or similar metrics. I became a bit skeptical of using ISO in these analyses since it assigns an arbitrary weight to each outcome.
5. I haven’t made any comparisons like that yet, but I like the idea.
I’m curious if draft position would help as an input? I’m assuming, of course, that your goal is to create the most accurate prospect system possible, rather than something that is totally independent of industry opinion.
I didn’t try that, but it’s an interesting idea. Honestly, I’m not necessarily sure making my model as accurate as possible is the ultimate goal. I think part of what makes KATOH interesting is that it isn’t influenced by things like industry opinions and scouting. As constituted, it’s completely objective, and relies entirely on the stats.
True. I guess the utility of the system would be fairly limited if it only projected players after the draft…
Anyway, I hope you can create a multi-season model and update it throughout Feb-June next year!
That’s the plan!
It seems like you got into some messy overfitting issues because you used ordered probit instead of a continuous DV. I assume you were doing that to avoid 80 WAR outliers, but using log WAR as your DV would make more sense as a solution. You get easier to interpret coefficients and generally internally consistent results with a linear model (or if you’re feeling fancy, polynomial). That’s why you see Happ with a 0% chance for >9 but a 7% chance of >10 and similar issues with Bregman, Benintendi, Stewart, etc. Ordered probit is quite complex, and I don’t think it’s helping you here (unless you want K% to have a stronger influence on whether a player amasses 3 WAR than on whether a player amasses 8 WAR). I know it’s harder to get probability outputs from a linear model, and I wish I were confident enough in my ability to play with standard errors to tell you how, but I know that’s the preferable way of going about this.
Also, I think there are some interactions worth exploring in KATOH hitter models, especially BB%*K%, which I think would be strongly negative (i.e. the guys who walk aren’t special unless they show plate discipline by avoiding strikeouts). And I second others’ call for oWAR (incl. baserunning), because there’s no good way to deal with predicting fielding.
Regardless, I’m enjoying everything you’ve done with KATOH, I can imagine how brutal those spreadsheets were. It’s fascinating to get some good, statistically minded draft coverage. Keep tweaking it! I’m pulling for Gosuke Katoh, and go Yankees.