Tool: Basically Every Hitting Stat Correlation
By popular demand (OK, one guy asked for it), my first offering in the re-envisioned THT is this batting version of my pitching statistic correlation tool (newer version here). This tool will allow you to see, both graphically and in terms of a correlation figure, how any two of FanGraphs’ batting statistics collectively relate to each other. With it, you’ll even be able to compare a group of players’ stats in one year to their stats in a different year. Comparing a stat in Year 0 to a stat in Year 1, for example, is a good way to gauge how predictive the first stat could possibly be of the other (just remember, correlation does not necessarily imply causation).
Without further ado, here’s the tool:
The white cells you see in the tool are the ones that you should be playing with. The Statistic and Year cells can be changed either by drop-down lists or by typing the name of the statistic directly (in the web app, it should help you narrow your choices when you start typing). Data should be entered directly into the other white cells.
As for the filters, the default setting considers a batter’s season only if they have 300 or more plate appearances in that season; you can set that as low as 100 PA, or higher if you’d like. The default year range is 2007-2013, but this can also be changed; but keep in mind these years affect the range of Year 0s, and that you should have Stat 1 set to year 0, or else you’ll be excluding some data you probably didn’t mean to. “Year 0” implies the present season, while “Year 1” implies the next season, and “Year -1” implies the previous year. The three filter categories at the bottom each have drop-down lists, allowing you to simultaneously filter by three extra statistics of your choosing.
A quick refresher on correlations: they range between -1 and 1. A correlation of 1 means that when one stat goes up, so does the other, in a straight line on a graph like the type you see above. Correlate a stat to itself in the same year and you’ll see a correlation of 1; for something more useful, try to correlate same-year OPS and wOBA – it should be 0.993, and pretty dang close to a straight line.
A correlation of -1 should also appear as a straight line, except ninety degrees off from a correlation of 1; the two stats move in opposite directions. You’d get this if you correlated a stat to the negative of itself, for some strange reason. For something more practical, try K% vs. Contact% in the same year, which should come in at a very strong -0.888.
A correlation of 0 suggests that there’s probably no relationship between the two stats, although it is possible to for there to be an interesting relationship that escapes the correlation calculation. The graph will be harder to fool, however, so you may want to keep an eye out for strange patterns you see on it.
The Confidence Level box can also be changed. By default, it’s set to provide the estimated boundaries between which the true correlation is 95% likely to lie between. You’ll see this below it.
An Exercise in Batted Ball and BABIP Correlational Analysis
By default, you’ll see a comparison on the tool between batters’ PU% in one year and their BABIP in the next. PU%, if you’re confused, is Pop-Up percentage, my unofficial name for infield fly balls per batted ball (batted ball being defined as FB+LD+GB), as opposed to the official stat IFFB%, which is infield fly balls per fly ball. What you’ll notice is that PU% does indeed appear to be fairly predictive of BABIP, in that batters who pop the ball up a lot in one year will tend to have a low BABIP in the next (the correlation is -0.386 in the default sample). Makes sense, right? Of course, it helps a lot that PU% is a fairly predictable stat, with a year-to-year correlation around 0.638, as you can see. For comparison, LD%—line drives per batted ball—has only a 0.366 YTY correlation, while BABIP’s is 0.370. To summarize:
Correlation with BABIP in Year: | |||
---|---|---|---|
Statistic | 0 (Same Year) | 1 (Next Year) | YTY Correlation (with itself) |
PU% | -0.468 | -0.386 | 0.638 |
OFFB% | -0.262 | -0.213 | 0.754 |
LD% | 0.418 | 0.187 | 0.366 |
GB% | 0.192 | 0.226 | 0.788 |
FB% | -0.356 | -0.288 | 0.789 |
IFFB% | -0.416 | -0.350 | 0.555 |
BABIP | 1 | 0.370 | 0.370 |
So, although LD% is a significant factor in same-season BABIP, its relative unpredictability makes it a much less reliable indicator of true-talent BABIP skills than PU%. This is also the case with pitchers, whose BABIPs are of course even less predictable.
If you’re curious, here are 2013’s relevant facts for each basic type of batted ball, straight from the league splits on FanGraphs:
Batted Ball Statistics, 2013 | |||||
---|---|---|---|---|---|
Type | BABIP | AVG | SLG | ISO | wOBA |
Line drives | 0.683 | 0.685 | 0.878 | 0.193 | 0.681 |
Ground Balls | 0.232 | 0.232 | 0.250 | 0.018 | 0.213 |
Fly Balls | 0.124 | 0.213 | 0.616 | 0.403 | 0.346 |
The low BABIP of fly balls in general might lead you to believe they are less desirable for a hitter than a ground ball. Don’t forget, though, that home runs are excluded from consideration in BABIP, meaning the batting average of a power-hitting fly ball hitter probably isn’t going to suffer as much as you might think. Clearly line drives get the best results, being low-risk with very high-rewards. Meanwhile, ground balls are medium-risk, low reward, and fly balls are high-risk, high reward; on average, though, FBs are preferable to GBs, as wOBA demonstrates. That’s not even taking into account the increased risk of double plays that comes with ground balls.
As a little bonus, here’s something I queried off of FanGraphs’ top-secret database: a more in-depth breakdown that uses more distinct batted ball types:
Batted Ball Statistics, 2013 | |||||||||
---|---|---|---|---|---|---|---|---|---|
Type | BABIP | AVG | SLG | ISO | wOBA | 1B% | 2B% | 3B% | HR% |
IFFB | 0.004 | 0.004 | 0.005 | 0.001 | 0.004 | 0.3% | 0.1% | 0.0% | 0.0% |
OFFB | 0.049 | 0.155 | 0.531 | 0.376 | 0.288 | 0.8% | 2.8% | 0.7% | 11.1% |
FlinerF | 0.280 | 0.362 | 0.889 | 0.528 | 0.530 | 7.7% | 15.5% | 1.5% | 11.4% |
FlinerL | 0.627 | 0.631 | 0.870 | 0.240 | 0.652 | 42.9% | 17.5% | 1.6% | 1.1% |
LD | 0.746 | 0.746 | 0.883 | 0.138 | 0.715 | 61.4% | 12.6% | 0.6% | 0.0% |
GB | 0.232 | 0.232 | 0.250 | 0.018 | 0.213 | 21.5% | 1.6% | 0.1% | 0.0% |
In this classification system, the two types of “Fliners” are somewhere between fly balls and line drives, and there’s no overlap between the classifications. Relating these to what you see on FanGraphs: IFFB, OFFB, and FlinerF are all counted towards FB, while FlinerL and LD are counted towards LD.
Here, OFFBs are the really high outfield flies which—if they don’t clear the fences—are going to be caught 95.1% of the time. But home runs do occur on 11.1% of these high outfield flies, so you can’t discount them. Remember that these numbers are just averages; for a powerless batter, OFFBs are likely going to be a really bad thing; for a power hitter, they might actually be good. And try not to be confused—in this article’s correlation tool, FlinerFs are included as part of “OFFB.” I’m just not sure if it’s alright for me to let the details of this system out of the bag, unfortunately.
OK, now forget I mentioned all that stuff about fliners, because I’m going to be referring to the standard FanGraphs batted ball classifications from now on.
Back to BABIP: the main point of it is not to directly value a player, but to be an indicator of how lucky the player was. Skill does come into play, however, especially in the case of batters. But let’s take a look at how batted ball types correlate with a bonus stat I added into the correlation tool: Hits/Batted Ball, (let’s call it H/BatBall for short) which are hits divided by the sum of fly balls, line drives, and ground balls.
H/Batball Correlations | |||||
---|---|---|---|---|---|
Correlation with H/BatBall | Correlation with BABIP | ||||
Statistic | 0 (Same Year) | 1 (Next Year) | YTY Correlation (with itself) | 0 (Same Year) | 1 (Next Year) |
PU% | -0.343 | -0.265 | 0.638 | -0.468 | -0.386 |
OFFB% | 0.006 | 0.030 | 0.754 | -0.262 | -0.213 |
LD% | 0.289 | 0.104 | 0.366 | 0.418 | 0.187 |
GB% | -0.034 | 0.004 | 0.788 | 0.192 | 0.226 |
FB% | -0.089 | -0.046 | 0.789 | -0.356 | -0.288 |
IFFB% | -0.370 | -0.296 | 0.555 | -0.416 | -0.350 |
BABIP | 0.894 | 0.315 | 0.370 | 1.000 | 0.370 |
H/BatBall | 1.000 | 0.420 | 0.420 | 0.894 | 0.315 |
HR/FB | 0.466 | 0.323 | 0.706 | 0.075 | 0.038 |
So, with home runs back in the equation, most of the predictiveness of the batted ball types—when it comes to the chance of getting a hit on a batted ball—completely disappear. Except for popups and maybe line drives (a little bit), that is. Also notice that HR/FB, while apparently useless for BABIP, is an important predictor of next-year H/BatBall. Not surprisingly, HR/FB is also a good predictor of wOBA (0.444 YTY correlation).
There are some interesting interactions here that take a multiple regression to weed out, though. Remember how I just said HR/FB is apparently useless for BABIP? Regression begs to differ; it outputs a formula for expected next-year BABIP of:
xBABIP = 0.083*HR/FB + 0.1*LD% – 0.55*PU% – 0.013*OFFB% + 0.007*Spd*GB% + 0.283
This formula has a 0.437 correlation with next-season BABIP, and 0.573 with same-season BABIP. More details on the factors:
Predictive Factors Of BABIP | ||||||||
---|---|---|---|---|---|---|---|---|
50% Values | 95% Values | |||||||
Statistic | Coefficients | Std Error | t Stat | P-value | Lower | Upper | Lower | Upper |
Intercept | 0.283 | 0.011 | 24.758 | 6.50E-110 | 0.275 | 0.29 | 0.260 | 0.305 |
LD% | 0.100 | 0.034 | 2.932 | 0.003432 | 0.077 | 0.123 | 0.033 | 0.167 |
PU% | -0.546 | 0.053 | -10.325 | 5.14E-24 | -0.582 | -0.510 | -0.650 | -0.442 |
Spd*GB% | 0.007 | 0.001 | 6.373 | 2.63E-10 | 0.007 | 0.008 | 0.005 | 0.010 |
OFFB% | -0.013 | 0.019 | -0.666 | 0.505428 | -0.025 | 0.000 | -0.050 | 0.025 |
HR/FB | 0.083 | 0.017 | 4.866 | 1.29E-06 | 0.072 | 0.095 | 0.050 | 0.117 |
Translation: OFFB% probably doesn’t matter, but the other factors pretty certainly do, especially PU%, followed by Spd*GB% (well, Spd itself works almost as well, leaving GB% out entirely), then HR/FB, then LD%. So, you can cut out OFFB% to make:
xBABIP = 0.08*HR/FB + 0.1*LD% – 0.56*PU% + 0.008*Spd*GB% + 0.278
…which is practically equally good, with a 0.436 correlation to next-year BABIP.
It might also be a good idea to add current BABIP itself to the equation, to possibly help capture that certain je ne sais quoi about a batter’s BABIP, if simply predicting the next year is the goal. Handedness is likely significant as well. But I’ll save that for another time.
Well, hopefully I’ve given you all enough to play with and to think about for today. Tell us in the comments if you find out something interesting from your experiments!
Good God.
Sorry and/or thank you?
“Good God” = what a non-rocket scientist says after hearing a rocket scientist speak. LOL
I’m am really happy I stumbled upon this. The examples you provided are EXACTLY what I was gathering info on to figure out the past couple of days. I expected I’d have to do the work myself, but you did a lot of it for me and gave me a took to do the rest. It made me say Good God too, but I’ll get all of what you said there figured out eventually. Thanks! 🙂
Ha, OK, thank you. Glad you enjoy.
Plotting OBP vs Age, same year, I don’t see any evidence that performance declines with age. The highest OBP on the chart is someone over 40. What stat should I be looking at to see the alleged decline with age that people seem to always talk about?
That point you’re referring to is Barry Bonds in 2007, FYI — 42, with a 0.480 OBP. So, a bit of an outlier there.
Yes, you’re right — there appears to be little to no overall correlation between age and OBP. But you have to keep survivorship bias in mind; the players still getting a significant number of PAs in their late 30s and beyond tend to be doing so because they’re still pretty good; all the players who had to retire or get too few PAs to qualify aren’t accounted for here.
What is probably a more appropriate basis for that sort of analysis with this tool is to look at OBP in year 0 vs. OBP in all the surrounding years, while using age as a filter; that way, you’re looking at the performance of individual batters over time, rather than the overall characteristics of all the players in a given year. It’s kind of subtle, but if you set the age filter to, say, 30-50, then you’ll see the slope of the bottom left regression equation on the graph will be less than one for future years (meaning future OBP will be lower) and greater than one for previous years (meaning OBP was greater in previous years).
Better yet would be to download the spreadsheet, insert a column next to OBP in the ‘Data’ sheet, and divide the player’s OBP by the league OBP in that season (by creating a table that contains the league OBP in each year and doing a VLOOKUP on it). OBP relative to league average would make a better basis for the comparisons than OBP itself, as OBP has been on the decline since 2007 (probably due to increasing K rates, mainly).
This is gold. Absolute gold. After building financial models and regression tools for my firm for the last 2 years I’ve wanted so bad to have one at my disposal for baseball but didn’t have the dataset/willpower after 5pm to get it done. THANK YOU for doing this. I’m already overturning my previous-thought assumptions.
You’re very welcome! The data all comes from fangraphs’ custom leaderboards, by the way.
Steve,
Phenomenal work – a clarification please. Is HR/FB based on all FBs or just OFFB (so, excluding IFFB). Also, are total season speed scores published anywhere for all players ?
Thanks.
Thank you!
HR/FB is a standard batted ball stat on FanGraphs, and it is based on all fly balls (though HR/OFFB might make more sense): http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=2&season=2013&month=0&season1=2013&ind=0&team=0&rost=0&age=0&filter=&players=0
Spd is published under ‘Advanced’: http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=1&season=2013&month=0&season1=2013&ind=0&team=0&rost=0&age=0&filter=&players=0
Thanks for the clarification. I agree – it should be HR/OFFB. Using total FB probably double counts infield flies since their negative impact should already be accounted for in the PU category. OK, so a new homework for you ??
OK, I just now stuck both OFFB% and HR/OFFB stats into the Data sheet. Please let me know if you notice that I broke something in doing so.
Where’s the data sheet ?
See the tab at the bottom — between ‘Main’ and ‘Calcs’? That’s where you can add whatever stats you want, and they’ll then show up as options in the drop-down lists on the ‘Main’ tab.
Is there any way to include Jeff Zimmerman’s fly ball distance (from baseballheatmaps.com) into these correlation tools (both pitcher and hitter)?
Great idea! I stuck ‘FB Distance’ and ‘FB Angle’ in this one just now, with what I could gather from Jeff’s site. There’s some missing data there, but Jeff is going to send me his data when he gets the chance, and I’ll update it.
OK, Jeff was kind enough to send me his official fly ball distance data for batters today. I’ve added the following:
FB Distance: the average distance of the batter’s fly balls and home runs
FB Dist +1.5 stdev: 1.5 of the batter’s standard deviations above their mean fly ball distance, a.k.a. the 93.3rd percentile of their fly ball distance (assuming normal distribution). In other words, this is the theoretical borderline past which 6.7% of their fly balls should be hit further than this. It’s kind of arbitrary, but is an indicator of how far the ball might go when they really hit it well.
FB Angle: the average angle of the batter’s fly balls, with -45 being the left field line, 0 being dead center, and 45 being the right field line.
FB Angle (abs): the average of the absolute values of a batter’s FB Angle. A batter low in this stat tends to hit fly balls more towards center field.
http://www.baseballheatmaps.com/graph/battedballdist.php
http://www.baseballheatmaps.com/graph/leaderboard.php
Interesting, but at age 88 I do not comprehend it. The most important state to me is the Game-Winning RBI, the one that put the team ahead to stay. Or it could be broken down to “RBI’s That Put His Team in a Tie or Ahead.”
–Stay tuned.
Hey Steve, I just want to throw in my two cents and say this is terrific. Thanks for making available to everyone (and for being so responsive to comments).
Thanks Dave!
Nice article! Stat question though ; when you created these formulas at the end did you do any kind of forward or backwards selection? Meaning were all these predictor variables reasonably independent of the other predictors? I was curious why HR/FB would be included in the model if before it showed a very little correlation. Would it be correct to assume that while the correlation was low, it accounted for a unique part of the xBABIP variance? Thanks!
Thank you! Great question. Well, my decision to try HR/FB out in the regression was largely based on the intuition that power would probably make a difference to BABIP, with HR/FB being one of the better proxies for power that are available (though I might want to try one of the newly-added fly ball distance stats instead).
I believe the explanation for why HR/FB apparently is predictive of BABIP in the multiple regression despite no direct correlation is this:
HR/FB’s correlation vs. next year’s…
GB%: -0.32
Spd: -0.29
LD%: -0.15
FB%: 0.36
So, it’s the fact that the big home run hitters tend to be slower and hit more low-BABIP-type batted balls (more FB, fewer GB and LD) that hides the fact that power in and of itself is actually beneficial to BABIP. If you can combine speed and power (e.g. Mike Trout), your BABIP will likely be pretty high.
Stephen,
I ran your xBABIP’s on last year’s data and they all seem very low. For example, Miggy was at .300 even though his BABIP last year was .356. The old xBABIP formula (listed below) is much closer at .344. If i add .330 at the end of your equation (vs. .283/.278 depending on which equation in your post) it outputs .347 which is much closer to his actual BABIP and old xBABIP (as well as everyone else’s).
Am i missing something or did you come up with the same results?
Thanks!
old xBABIP formula:
xBABIP = 0.392 + (LD% x 0.287709436) + ((GB% – (GB% * IFH%)) x -0.152 ) + ((FB% – (FB% x HR/FB%) – (FB% x IFFB%)) x -0.188) + ((IFFB% * FB%) x -0.835) + ((IFH% * GB%) x 0.500)
I still love this. A lot.
Well, I’m not sure if I am happy or not that I stumbled across this site. If staying on a site for more than three hours at a time is good, then I’m guilty.
Just a question I hope someone can answer: In terms of all the stats presented here and on fangraphs, which 10 or 15 (from first to 15th) are the most accurate predictors on how a hitter will do from day to day? I am doing research and have come across various answers. Some say success against a certain pitcher is very important, others say it’s too small a sample size (even with >50 at bats) to rely on.
What I plan to do is once I figure out the most to least important stats for determining “success” of a batter, is to multiply the stats by weighted factors. The top influential stats will bear more heavily into the final factor, and the lower rungs have less input.
Any response would be greatly appreciated. Thanks for a great resource!
J
Bump to my previous comment…
Also, is there a way to correlate batters’ stats vs. PITCHERS’ stats? The correlation between the various batters’ stats to themselves is very fascinating, but one wonders which pitchers’s stats give best correlations/predictors to various hitters’ stats.
Thank you for a fabulous resource.
J
This doesn’t seem to be working anymore. I love the tool, could someone find a way around the 5Mb limit?