Tool: Basically Every Hitting Stat Correlation

With this tool, you have the ability to compare any hitting stat in the field of play (via Joel Kramer).

With this tool, you have the ability to compare any hitting stat in the field of play (via Joel Kramer).

By popular demand (OK, one guy asked for it), my first offering in the re-envisioned THT is this batting version of my pitching statistic correlation tool (newer version here). This tool will allow you to see, both graphically and in terms of a correlation figure, how any two of FanGraphs’ batting statistics collectively relate to each other. With it, you’ll even be able to compare a group of players’ stats in one year to their stats in a different year. Comparing a stat in Year 0 to a stat in Year 1, for example, is a good way to gauge how predictive the first stat could possibly be of the other (just remember, correlation does not necessarily imply causation).

Without further ado, here’s the tool:

The white cells you see in the tool are the ones that you should be playing with. The Statistic and Year cells can be changed either by drop-down lists or by typing the name of the statistic directly (in the web app, it should help you narrow your choices when you start typing). Data should be entered directly into the other white cells.

As for the filters, the default setting considers a batter’s season only if they have 300 or more plate appearances in that season; you can set that as low as 100 PA, or higher if you’d like. The default year range is 2007-2013, but this can also be changed; but keep in mind these years affect the range of Year 0s, and that you should have Stat 1 set to year 0, or else you’ll be excluding some data you probably didn’t mean to. “Year 0” implies the present season, while “Year 1” implies the next season, and “Year -1” implies the previous year. The three filter categories at the bottom each have drop-down lists, allowing you to simultaneously filter by three extra statistics of your choosing.

A quick refresher on correlations: they range between -1 and 1. A correlation of 1 means that when one stat goes up, so does the other, in a straight line on a graph like the type you see above. Correlate a stat to itself in the same year and you’ll see a correlation of 1; for something more useful, try to correlate same-year OPS and wOBA – it should be 0.993, and pretty dang close to a straight line.

A correlation of -1 should also appear as a straight line, except ninety degrees off from a correlation of 1; the two stats move in opposite directions. You’d get this if you correlated a stat to the negative of itself, for some strange reason. For something more practical, try K% vs. Contact% in the same year, which should come in at a very strong -0.888.

A correlation of 0 suggests that there’s probably no relationship between the two stats, although it is possible to for there to be an interesting relationship that escapes the correlation calculation. The graph will be harder to fool, however, so you may want to keep an eye out for strange patterns you see on it.

The Confidence Level box can also be changed. By default, it’s set to provide the estimated boundaries between which the true correlation is 95% likely to lie between. You’ll see this below it.

An Exercise in Batted Ball and BABIP Correlational Analysis
By default, you’ll see a comparison on the tool between batters’ PU% in one year and their BABIP in the next. PU%, if you’re confused, is Pop-Up percentage, my unofficial name for infield fly balls per batted ball (batted ball being defined as FB+LD+GB), as opposed to the official stat IFFB%, which is infield fly balls per fly ball. What you’ll notice is that PU% does indeed appear to be fairly predictive of BABIP, in that batters who pop the ball up a lot in one year will tend to have a low BABIP in the next (the correlation is -0.386 in the default sample). Makes sense, right? Of course, it helps a lot that PU% is a fairly predictable stat, with a year-to-year correlation around 0.638, as you can see. For comparison, LD%—line drives per batted ball—has only a 0.366 YTY correlation, while BABIP’s is 0.370. To summarize:

Correlation with BABIP in Year:
Statistic 0 (Same Year) 1 (Next Year) YTY Correlation (with itself)
PU% -0.468 -0.386 0.638
OFFB% -0.262 -0.213 0.754
LD% 0.418 0.187 0.366
GB% 0.192 0.226 0.788
FB% -0.356 -0.288 0.789
IFFB% -0.416 -0.350 0.555
BABIP 1 0.370 0.370

So, although LD% is a significant factor in same-season BABIP, its relative unpredictability makes it a much less reliable indicator of true-talent BABIP skills than PU%. This is also the case with pitchers, whose BABIPs are of course even less predictable.

If you’re curious, here are 2013’s relevant facts for each basic type of batted ball, straight from the league splits on FanGraphs:

Batted Ball Statistics, 2013
Type BABIP AVG SLG ISO wOBA
Line drives 0.683 0.685 0.878 0.193 0.681
Ground Balls 0.232 0.232 0.250 0.018 0.213
Fly Balls 0.124 0.213 0.616 0.403 0.346

The low BABIP of fly balls in general might lead you to believe they are less desirable for a hitter than a ground ball. Don’t forget, though, that home runs are excluded from consideration in BABIP, meaning the batting average of a power-hitting fly ball hitter probably isn’t going to suffer as much as you might think. Clearly line drives get the best results, being low-risk with very high-rewards. Meanwhile, ground balls are medium-risk, low reward, and fly balls are high-risk, high reward; on average, though, FBs are preferable to GBs, as wOBA demonstrates. That’s not even taking into account the increased risk of double plays that comes with ground balls.

As a little bonus, here’s something I queried off of FanGraphs’ top-secret database: a more in-depth breakdown that uses more distinct batted ball types:

Batted Ball Statistics, 2013
Type BABIP AVG SLG ISO wOBA 1B% 2B% 3B% HR%
IFFB 0.004 0.004 0.005 0.001 0.004 0.3% 0.1% 0.0% 0.0%
OFFB 0.049 0.155 0.531 0.376 0.288 0.8% 2.8% 0.7% 11.1%
FlinerF 0.280 0.362 0.889 0.528 0.530 7.7% 15.5% 1.5% 11.4%
FlinerL 0.627 0.631 0.870 0.240 0.652 42.9% 17.5% 1.6% 1.1%
LD 0.746 0.746 0.883 0.138 0.715 61.4% 12.6% 0.6% 0.0%
GB 0.232 0.232 0.250 0.018 0.213 21.5% 1.6% 0.1% 0.0%

In this classification system, the two types of “Fliners” are somewhere between fly balls and line drives, and there’s no overlap between the classifications. Relating these to what you see on FanGraphs: IFFB, OFFB, and FlinerF are all counted towards FB, while FlinerL and LD are counted towards LD.

Here, OFFBs are the really high outfield flies which—if they don’t clear the fences—are going to be caught 95.1% of the time. But home runs do occur on 11.1% of these high outfield flies, so you can’t discount them. Remember that these numbers are just averages; for a powerless batter, OFFBs are likely going to be a really bad thing; for a power hitter, they might actually be good. And try not to be confused—in this article’s correlation tool, FlinerFs are included as part of “OFFB.” I’m just not sure if it’s alright for me to let the details of this system out of the bag, unfortunately.

OK, now forget I mentioned all that stuff about fliners, because I’m going to be referring to the standard FanGraphs batted ball classifications from now on.

Back to BABIP: the main point of it is not to directly value a player, but to be an indicator of how lucky the player was. Skill does come into play, however, especially in the case of batters. But let’s take a look at how batted ball types correlate with a bonus stat I added into the correlation tool: Hits/Batted Ball, (let’s call it H/BatBall for short) which are hits divided by the sum of fly balls, line drives, and ground balls.

H/Batball Correlations
Correlation with H/BatBall Correlation with BABIP
Statistic 0 (Same Year) 1 (Next Year) YTY Correlation (with itself) 0 (Same Year) 1 (Next Year)
PU% -0.343 -0.265 0.638 -0.468 -0.386
OFFB% 0.006 0.030 0.754 -0.262 -0.213
LD% 0.289 0.104 0.366 0.418 0.187
GB% -0.034 0.004 0.788 0.192 0.226
FB% -0.089 -0.046 0.789 -0.356 -0.288
IFFB% -0.370 -0.296 0.555 -0.416 -0.350
BABIP 0.894 0.315 0.370 1.000 0.370
H/BatBall 1.000 0.420 0.420 0.894 0.315
HR/FB 0.466 0.323 0.706 0.075 0.038

So, with home runs back in the equation, most of the predictiveness of the batted ball types—when it comes to the chance of getting a hit on a batted ball—completely disappear. Except for popups and maybe line drives (a little bit), that is. Also notice that HR/FB, while apparently useless for BABIP, is an important predictor of next-year H/BatBall. Not surprisingly, HR/FB is also a good predictor of wOBA (0.444 YTY correlation).

There are some interesting interactions here that take a multiple regression to weed out, though. Remember how I just said HR/FB is apparently useless for BABIP? Regression begs to differ; it outputs a formula for expected next-year BABIP of:

xBABIP = 0.083*HR/FB + 0.1*LD% – 0.55*PU% – 0.013*OFFB% + 0.007*Spd*GB% + 0.283

This formula has a 0.437 correlation with next-season BABIP, and 0.573 with same-season BABIP. More details on the factors:

Predictive Factors Of BABIP
50% Values 95% Values
Statistic Coefficients Std Error t Stat P-value Lower Upper Lower Upper
Intercept 0.283 0.011 24.758 6.50E-110 0.275 0.29 0.260 0.305
LD% 0.100 0.034 2.932 0.003432 0.077 0.123 0.033 0.167
PU% -0.546 0.053 -10.325 5.14E-24 -0.582 -0.510 -0.650 -0.442
Spd*GB% 0.007 0.001 6.373 2.63E-10 0.007 0.008 0.005 0.010
OFFB% -0.013 0.019 -0.666 0.505428 -0.025 0.000 -0.050 0.025
HR/FB 0.083 0.017 4.866 1.29E-06 0.072 0.095 0.050 0.117

Translation: OFFB% probably doesn’t matter, but the other factors pretty certainly do, especially PU%, followed by Spd*GB% (well, Spd itself works almost as well, leaving GB% out entirely), then HR/FB, then LD%. So, you can cut out OFFB% to make:

xBABIP = 0.08*HR/FB + 0.1*LD% – 0.56*PU% + 0.008*Spd*GB% + 0.278

…which is practically equally good, with a 0.436 correlation to next-year BABIP.

It might also be a good idea to add current BABIP itself to the equation, to possibly help capture that certain je ne sais quoi about a batter’s BABIP, if simply predicting the next year is the goal. Handedness is likely significant as well. But I’ll save that for another time.

Well, hopefully I’ve given you all enough to play with and to think about for today. Tell us in the comments if you find out something interesting from your experiments!


Steve is a robot created for the purpose of writing about baseball statistics. One day, he may become self-aware, and...attempt to make money or something?
28 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
jim S.
10 years ago

Good God.

Vincent Jones
10 years ago

“Good God” = what a non-rocket scientist says after hearing a rocket scientist speak. LOL

I’m am really happy I stumbled upon this. The examples you provided are EXACTLY what I was gathering info on to figure out the past couple of days. I expected I’d have to do the work myself, but you did a lot of it for me and gave me a took to do the rest. It made me say Good God too, but I’ll get all of what you said there figured out eventually. Thanks! 🙂

bob
10 years ago

Plotting OBP vs Age, same year, I don’t see any evidence that performance declines with age. The highest OBP on the chart is someone over 40. What stat should I be looking at to see the alleged decline with age that people seem to always talk about?

Ben Denissen
10 years ago

This is gold. Absolute gold. After building financial models and regression tools for my firm for the last 2 years I’ve wanted so bad to have one at my disposal for baseball but didn’t have the dataset/willpower after 5pm to get it done. THANK YOU for doing this. I’m already overturning my previous-thought assumptions.

birdwatcher
10 years ago

Steve,

Phenomenal work – a clarification please. Is HR/FB based on all FBs or just OFFB (so, excluding IFFB). Also, are total season speed scores published anywhere for all players ?
Thanks.

birdwatcher
10 years ago

Thanks for the clarification. I agree – it should be HR/OFFB. Using total FB probably double counts infield flies since their negative impact should already be accounted for in the PU category. OK, so a new homework for you ??

birdwatcher
10 years ago

Where’s the data sheet ?

Daniel Brim
10 years ago

Is there any way to include Jeff Zimmerman’s fly ball distance (from baseballheatmaps.com) into these correlation tools (both pitcher and hitter)?

Grandpa Boog
10 years ago

Interesting, but at age 88 I do not comprehend it. The most important state to me is the Game-Winning RBI, the one that put the team ahead to stay. Or it could be broken down to “RBI’s That Put His Team in a Tie or Ahead.”

–Stay tuned.

Dave Studemanmember
10 years ago

Hey Steve, I just want to throw in my two cents and say this is terrific. Thanks for making available to everyone (and for being so responsive to comments).

Chris B
10 years ago

Nice article! Stat question though ; when you created these formulas at the end did you do any kind of forward or backwards selection? Meaning were all these predictor variables reasonably independent of the other predictors? I was curious why HR/FB would be included in the model if before it showed a very little correlation. Would it be correct to assume that while the correlation was low, it accounted for a unique part of the xBABIP variance? Thanks!

Dan Schwartz
10 years ago

Stephen,
I ran your xBABIP’s on last year’s data and they all seem very low. For example, Miggy was at .300 even though his BABIP last year was .356. The old xBABIP formula (listed below) is much closer at .344. If i add .330 at the end of your equation (vs. .283/.278 depending on which equation in your post) it outputs .347 which is much closer to his actual BABIP and old xBABIP (as well as everyone else’s).

Am i missing something or did you come up with the same results?

Thanks!

old xBABIP formula:
xBABIP = 0.392 + (LD% x 0.287709436) + ((GB% – (GB% * IFH%)) x -0.152 ) + ((FB% – (FB% x HR/FB%) – (FB% x IFFB%)) x -0.188) + ((IFFB% * FB%) x -0.835) + ((IFH% * GB%) x 0.500)

dang
10 years ago

I still love this. A lot.

Jesse
9 years ago

Well, I’m not sure if I am happy or not that I stumbled across this site. If staying on a site for more than three hours at a time is good, then I’m guilty.

Just a question I hope someone can answer: In terms of all the stats presented here and on fangraphs, which 10 or 15 (from first to 15th) are the most accurate predictors on how a hitter will do from day to day? I am doing research and have come across various answers. Some say success against a certain pitcher is very important, others say it’s too small a sample size (even with >50 at bats) to rely on.

What I plan to do is once I figure out the most to least important stats for determining “success” of a batter, is to multiply the stats by weighted factors. The top influential stats will bear more heavily into the final factor, and the lower rungs have less input.

Any response would be greatly appreciated. Thanks for a great resource!
J

Jesse Blattstein
9 years ago

Bump to my previous comment…

Also, is there a way to correlate batters’ stats vs. PITCHERS’ stats? The correlation between the various batters’ stats to themselves is very fascinating, but one wonders which pitchers’s stats give best correlations/predictors to various hitters’ stats.

Thank you for a fabulous resource.
J

Steve
8 years ago

This doesn’t seem to be working anymore. I love the tool, could someone find a way around the 5Mb limit?