A New Way to Look at College Players’ Stats

Alex Bregman hasn't experienced much regression in his numbers since turning pro. (via thatlostdog)

Alex Bregman hasn’t experienced much regression in his numbers since turning pro. (via thatlostdog)

With the College World Series upon us this month, I wanted to examine college baseball analytics. People like Chris Long and Chris Mitchell (with his KATOH Projections) have calculated which players are more likely to make the transition from college to the majors. For this article, I will mainly look at just the college game, though I will use draft data to compare conference quality. I will show how players at NCAA Division I and Junior College Division I age and the expected regression factor for some stats. Additionally, I did a small sample study on how players perform as they move from the JUCO ranks to Division I. And as with all good data dives, I will leave with more questions for both you and to ponder in future studies.

College stats are not the easiest to collect. For Division I, I used data from the NCAA’s website. The data at times were missing: the players didn’t have a grade associated with them, and had new stats added or rearranged columns from season to season. In the end, I eliminated about two percent of the data for various reasons. I had to make the call of spending days aiming for a 100 percent clean data set or having this study done in a semi-timely fashion. The amount of lost data is around six team’s worth. Since I was originally dealing with 281 teams, the new data set has ~275 teams worth of data. I see no reason to be suspicious of the data’s results because of this, as there is still a huge sample remaining.

Now if you need reasons to suspect the results, here are some concerns I have. When looking at the aging data and JUCO to NCAA translations, I compared how the players did from the listed grade, not the players’ age. They can take six years to play out their college careers if they redshirt a season and get an extra season of eligibility for medical reasons. I wasn’t able to link each back to see if they redshirted at any time and obviously someone who did redshirt may be older than the others in his or her grade. One way I helped to eliminate this bias was by looking only at those who played in consecutive seasons.

One other issue that could come into play is the best players getting drafted after their junior season. In other words, the best talent is removed before aging from junior to senior season. While this is just a handful of players, the change could be measurable, especially in the major conferences.

A third issue is the pitching data at times didn’t make sense in several ways (as seen with the Division I starting pitcher aging). Pitchers in college change roles regularly and pinning down starters versus relievers is very time-consuming. In several areas of study, I just had to raise my hands in surrender for this particular article. I may end up writing an entire article on this specific subject.

Finally, I didn’t adjust any numbers based on the quality or run environment of the various leagues. I know the teams who play in higher elevations have more runs scored against them. Baby steps for now.

Hitter Regression Amount

Regression, or how much the player will perform compared to the league average, is a major obstacle in understanding college stats. The biggest issue is that college players don’t play as many games as major leaguers. A Division I hitter who plays all four years will at the most have around 1,000 plate appearances, which is the equivalent two full minor league seasons or less than two full major league seasons. In the majors, it takes thousands of at-bats for some stats to stabilize. College kids just don’t get this many chances.

As a result, I knew I might not find much, but there are definitely some good results at hand. First, here are the stabilization points for some common stats. The stabilization point doesn’t mean the player’s season (or career) stats are his true talent level. At the stabilization point, the player’s talent level is half his production and half the league average. For any value over 700 plate appearances for D1 and 250 plate appearances for JUCO, I found the stabilization shown by projecting the value using a linear growth rate. This is far from an ideal method and its flaws can be seen with the difference between the D1 and Junior College (JUCO) values.

REGRESSION AMOUNTS (NCAA DIV. 1)
Stat Regression Amount League Average Value
K% 225 0.165
ISO 615 0.115
OBP 950 0.358
BB% 950 0.093
wOBA 1000 0.326
SLG 1200 0.389
AVG 1625 0.274
wOBA is a stat which attempts to properly weigh each hitter’s component value knowing how much value it gives. The formula I used is explained here.

No real surprises here, as the college stabilization points line up similarly with the MLB points. Power (ISO) and plate discipline (BB% and K%) stabilize before stats that count on defenders and randomness, like batting average does. The only two stats which may stabilize over a typical college player’s career are ISO and strikeout percentage

The findings help to verify the findings of Chris Mitchell and his KATOH projection system. He also found strikeouts and doubles power to be the leading indicators of future major league performance. The reason they are likely the best indicators is because they have stabilized and give a good picture of the player’s talent.

Besides Division I players, I’m interested in the JUCO ranks and ran the numbers for NJCAA Division I hitters.

REGRESSION AMOUNTS (NJCAA DIV. 1)
Stat Regression Amount League Average
K% 200 0.160
ISO 235 0.121
SLG 245 0.421
BB% 300 0.101
wOBA 400 0.358
OBP 450 0.387
AVG 800 0.299

Much of the same, but with the smaller amount of plate appearances to determine regression amounts the values look to stabilize sooner.

Aging Curves

For the aging curves, I lined up how D1 players performed from their freshman to sophomore, sophomore to junior, and junior to senior seasons. To qualify, they needed to be on the same team and play in consecutive seasons. If a player changed teams or missed a season due to injury or being redshirted, his numbers for those seasons were not used.

The aging curve was created by the delta method by weighting plate appearances using their harmonic means. With this method, there’s a small survivor bias, which was summarized in a previous THT article by Mitchel Lichtman:

I have also explained the problem of survivor bias, an inherent defect in the delta method, which is that the pool of players who see the light of day at the end of a season (and live to play another day the following year) tend to have gotten lucky in Year 1 and will see a “false” drop in Year 2 even if their true talent were to remain the same. This survivor bias will tend to push down the overall peak age and magnify the decrease in performance (or mitigate the increase) at all age intervals.

Hitters (D1)

bat_batting bat_plateThe results look to follow the normal aging pattern of players initially getting significantly better, but the amount of change declines with time. The one data point that sticks out is the drop in ISO from the junior to senior season.

Pitchers (D1)

pitch_run pitch_platepitch_babip pitch_hr9Again, no real surprises with pitchers generally getting better over the four season at a decreasing rate. The one exception is starting pitchers. They see a nice improvement from their freshman to sophomore season, but then the rate of improvement drops off from their sophomore to junior season. Finally, there is a huge jump from junior to senior. The more I have thought about this anomaly, the more I think it is just that, an anomaly. I can’t come up with any reason for the change. If you have an idea or theory, please let me know in the comments.

Now here is a look at how top division junior college hitters age.

Hitters (JUCO)

NJCAA (D1) AGING FACTOR FROM FRESHMAN TO SOPHOMORE SEASON
Stat Change from Freshman to Sophomore
wOBA 0.028
AVG 0.016
OBP 0.019
SLG 0.053
ISO 0.037
K% -0.4%
BB% 0.6%

All the numbers are headed in the expected directions, with the ISO jump seeming to be the most significant. It seems like JUCO players are able to make some significant progress in the weight room once on campus.

JUCO to D1 Hitter Translations

This small study originated when I saw two local (Kansas) players, Garrett Benge of Cowley College and Brylie Ware of Neosho County Community College, put up some video game-like numbers the last couple of seasons. In 2015, Benge (who now plays for Oklahoma State University) posted a .502/.601/.946 slash line, while Ware — who is now committed to the University of Oklahoma — hit .560/.660/1.128 this season. Both players won JUCO player of the year honors, but what I was curious about is how they should perform when making the jump to D1.

I wanted to find the average amount of decline for a hitter as he converts from JUCO ranks to D1. Finding these players is sort of a scavenger hunt and not a simple query. Again, I kept the search to batters making the transition in my area of the country, looking at JUCO regulars who recently made the move to the Big 12 or Missouri Valley Conference. I looked at their last season JUCO stats and the stats they posted in their first season after they transferred. Here are the average and mean values for players making the jump.

TRANSLATIONS FROM JUCO TO NCAA D1
Stat Average Change Median Change
AVG -0.063 -0.063
OBP -0.058 -0.057
SLG -0.096 -0.067
ISO -0.032 -0.007
wOBA -0.053 -0.056
K% 4.4% 4.5%
BB% -0.7% -0.6%

As the data show, hitters should expect a fairly decent drop in their stats as they make the jump. The biggest change is the drop in batting average. Some of the drop is from a 4.5 percent increase in their strikeout rate. Aside from the strikeout rate, I am pretty sure improved defenses help turn more batted balls into outs. Finally, there is probably some selection bias going on with hitters who have some BABIP luck in JUCO and see their average drop back down to their talent level once at a D1 school.

I stayed away from the pitchers, as most of those who made the jump went from JUCO starter to D1 reliever. The many dynamics of college pitchers is something I hope to explore in the future.

Conference Talent Level Using Draft Slot

For this study, I looked at the major league draft from 2013 to 2015 and grouped the college players drafted by their D1 conference. If the player wasn’t from a D1 conference, I grouped all the players from the levels together. The separate levels are NCAA DII and DIII, JUCO I, II, and III, CCCAA, NW JUCO, NAIA and California Christan. To start with, here are the total number of players drafted by conference and level for the three years.

draft_totalV3

The first item that really stands out is how many JUCO and Division II players make up the largest group. These top two levels don’t have a ton of top shelf talent as we will see next, but they are heavily scouted as possible diamonds in the rough. The list shows the expected top D1 conferences rankings with the SEC, ACC, and Pac-12 taking up the top spots.

While the DII and JUCO have the most players, how about the picks’ quality? I first found this by looking at the average number of picks from draft’s first 11 rounds. Recently, most of the top talent is taken in these rounds because of how the slotting bonuses work. Most of the players drafted after the 11th round are likely to have less ability.

draft_11RoundsV3

JUCO and DII (and all the other smaller divisions) dropped, with the SEC, ACC, and Pac-12 taking the jump to the top. Besides the number of top 11 round picks, I thought it would also be instructive to see how many draft picks it took for the conference to have its 10th player selected.

draft_10thPlayerV3

As you can see, this graph is smaller, as not every conference has 10 players drafted in a single season. Those conferences are: AEC, America East, CAA, CCAAA, Horizon, Ivy, JIII, MAAC, MEAC, NCCAA, NEC, NWCC, Patriot, Summit, Sunshine and SWAC.

Of those conferences on the graph, we can see once again that the traditional baseball conferences top the list.

I have thought of several ways these tables can be useful. First, they can help to set the conference’s power in future discussions. Also, it can be used to help determine how a major league team can distribute its scouting resources to find the most talent.

Future Work

There is plenty more that can be explored in further research. I already mentioned the challenges of pitching roles and correctly interpreting the data. Also, I would like to combine work of aging curves while looking only at certain conferences (SEC) and teams (Stanford hitters come to mind). How does the aging curve compare for NCAA D2 players and JUCO Division 2? Do JUCO stars have a harder transition to different D1 conferences? Another major factor to include would be the league and park factors to help determine different scoring environments. The possibilities for further research are as long as this article.

Hopefully, the preceding tables and graphs provide enough new information to chew on for a while. From aging curves to conference rankings, the college game can be better understood. Enjoy the end of the college baseball season!

References & Resources


Jeff, one of the authors of the fantasy baseball guide,The Process, writes for RotoGraphs, The Hardball Times, Rotowire, Baseball America, and BaseballHQ. He has been nominated for two SABR Analytics Research Award for Contemporary Analysis and won it in 2013 in tandem with Bill Petti. He has won three FSWA Awards including on for his MASH series. In his first two seasons in Tout Wars, he's won the H2H league and mixed auction league. Follow him on Twitter @jeffwzimmerman.
newest oldest most voted
Rodney Strong
Guest
Rodney Strong

Has somebody out there tackled the issues Jeff mentioned with a clean amateur data set? If not, can someone start this as a collaborative public project?

Peter Jensen
Guest
Peter Jensen

Several people have created voluminous records of amateur data. but all that I know of are proprietary. So go ahead and get started on creating a clean data set of amateur data for public use. Everyone will appreciate it and some may even help you. If you need or want the data you get to be the “someone” who starts the project instead of expecting someone else to do it for you.

Klubot3000
Guest
Klubot3000

One way to increase the data set is to include summer stats for the players. They’d probably have to be regressed to league averages because of the switch to wood bats and the huge variation in league quality (think the Cape v. Jayhawk leagues) but still could provide an interesting addition to the values.