# Controlling the strike zone and batting average

Peripheral statistics are brought up a good deal in the discussion of pitchers. Strikeouts, walks and home runs, as well as other statistics, can be converted into rate stats that give a decent picture of a pitcher’s underlying performance. Peripheral statistics are not unique to pitchers though; they are also useful when evaluating hitters.

A hitter’s walk-to-strikeout ratio, or some derivative equation using those components, is used to evaluate a hitter’s **approach**, or often used in the discussion of how well a hitter “controls the strike zone.”

This idea comes up in the discussion of prospects. A prospect can have great results, but if he is striking out way more than he’s walking, that usually becomes a concern.

Will Middlebrooks, a Boston Red Sox third base prospect, was called up to the big leagues in early May after veteran third baseman Kevin Youkilissustained an injury. At the time of Middlebrooks’ call-up, Kevin Goldstein wrote this for Baseball Prospectus about Middlebrooks’ ability to compete right away:

The biggest challenge for Middlebrooks will be his approach. He sees far too many pitches as hittable and can expand his strike zone at times, which is a trait big league pitchers will surely exploit. The power should play immediately, but he could struggle in the batting average category.

In Middlebrooks’ first three months in the bigs (May through July), he hit for a very high batting average (.301), despite a dreadful walk-to-strikeout ration (0.16; the major league average is 0.40). Also, his batting average on balls in play was very high (.357).

Given Goldstein’s profile and Middlebrooks’s peripherals, his batting average results hardly seemed sustainable.

In 10 games in August, Middlebrooks hit at a .194 clip with a .190 BABIP, but we were unable to see if this small-sample downward trend would continue, since he broke his wrist and would finish the season on the disabled list.

Middlebrooks found surprising success, but it will be interesting to see if that success will continue with the same approach during the 2013 season.

This idea not only applies to prospects, but can in some cases be relevant when discussing hitters with a ton of major league experience.

This offseason, the top free agent hitter is outfielder, Josh Hamilton, who swings as much as anyone else in baseball. In August, I wrote about how Hamilton’s plate discipline may have been the reason for his inconsistent play in 2012. I’ve heard other analysts bring up Hamilton’s plate discipline as a cause for concern in giving him a long-term deal. However, it seems to some that Hamilton may be one of the few examples of a hitter succeeding with little regard for taking pitches and controlling the strike zone. Most players cannot sustain the success that Hamilton has had with this type of approach.

If approach seems to not matter for Hamilton, how much does it matter?

Here’s what I mean by approach. A big league hitter has to go to the plate with a plan in mind. For Hamilton it may be swing if it’s anywhere near the zone, while for Youkillis it may be to take as many pitches as possible in order to draw a walk. What I’m using to define approach for this piece is simple: If a hitter is able to have a solid idea of what he wants to do to succeed at the plate, we should see him striking out only about as often as he walks, instead of four times as much.

The variability in BABIP is a major factor in why we tend to see a fairly weak correlation in year-to-year batting average. I decided to find out whether subtracting walks from strikeouts, as a simple defining tool for a hitter’s approach or ability to control the strike zone, does a better job at predicting batting average in the next season than batting average itself..

### The study

I took a sample of hitters (n=889) who had a least 350 plate appearances in Year X and at least 350 plate appearances in Year X+1, for the years 2007 to 2012.

First, I ran a simple linear regression for the correlation between batting average in Year X and batting average in Year X+1.

I found an r-squared of 18.1 percent for this sample; which means that batting average in one year explains 18.1 percent of the variation in batting average in the subsequent season. This correlation is fairly strong, and about where I expected it to me.

Then I ran the same linear regression, but this time I used walks minus strikeouts divided by plate appearances (BB-K/PA) as the predictor of future batting average. The correlation wasn’t nearly as strong: A batter’s quasi-approach explained only 9.52 percent of the variance in future batting average.

Using a multiple regression as opposed to simple linear regression usually yields stronger, more accurate, results. Thus, I first tried to improve the BB-K predictor by running a multiple regression with two predictors (BB/PA and K/PA); this helped re-weight the BB-K/PA formula into a stronger predictor, as the r-squared improved to 11.53 percent.

That number still wasn’t nearly as strong as simply using batting average, but I figured there was a possibility that combining approach and batting average in a multiple regression would improve the results.

I found that when walks, strikeouts and batting average are combined as three separate predictors, the overall r-squared from just using batting average improved from 18.1 percent to 21.4 percent.

Interestingly, walk percentage was not a significant predictor once batting average was included in the multiple regression equation; thus, walks were factored out and strikeouts and batting average were tested by themselves, which resulted in an r-squared that was still 21.3 percent.

Finally, I decided to see if a regressed version of BABIP would weed out some of the variability in batting average during in the predictor season and improve the model.

To go about doing this, I took each batter’s BABIP in year X and used just 25 percent of that number, while regressing the other 75 percent back to the league average, because it takes around two to two and a half seasons for BABIP to stabilize for hitters. I then combined this regressed BABIP with strikeouts and batting average.

All three predictors were significant, but the BABIP-batting average model caused a slight decrease in overall r-squared (20.8 percent).

Thus, the most predictive model that I found, given the few statistics tested, was simply a combination of strikeout percentage and batting average.

### Issues with the study:

This study was nowhere near perfect, and I actually have a few issues with my own test.

First, I’m not sure that using walk percentage, or even walks minus strikeouts, to predict batting average is the best idea.

A plate appearance that results in a strikeout is considered an at-bat; thus, it goes in the denominator of batting average, while a plate appearance that results in a walk does not.

I think it’s possible that if I had tested the results against hits per plate appearance instead of hits per at-bat, walks may have become significant. I’m not sure that hits per PA is the best thing to test, though. A hitter who does a good job of controlling the strike zone—for instance Ian Kinsler—tends to not have a great batting average (or hits per PA), but more than makes up for that by having a high on-base percentage.

The minimum plate appearance that I set also could have caused problems. My criteria required a hitter to have at least 350 plate appearances in consecutive seasons. Abstractly it makes sense that if a batter is allowed to have more than 700 plate appearances across two big league seasons, he probably has to ability (or his individual approach is good enough) to hit at the major league level.

This factor brought me to a possible idea for a follow-up article. As I mentioned earlier, Middlebrooks’ approach led scouts to be concerned about his batting average in the bigs. It’s possible that if a similar test was done for just rookies, walks and strikeouts might become more predictive.

It makes sense to me that it’s more important to project into the future how well a rookie’s approach translates into success than how well a veteran’s approach translates, given that we’ve already seen him hit in the bigs.

Strikeouts are rising in baseball. The number of hitters who walk more than they strike out is dwindling. The number of qualified hitters who walk more than they strike out has not broken double digits since 2009 (just four in 2012).

There’s a possibility that a batter’s ability to not strike out is becoming more important, in terms of hitting, than walks, given this rise. That, of course is simply a postulation, but there’s a chance it is having some effect on these results and is something to consider.

If a conclusion to this piece is necessary, I’d say that these results indicate a hitter’s ability to either make contact or not chase pitches (a la not strike out) is more important than the ability to take a walk. This makes some intuitive sense, as more contact and fewer strikeouts should lead to more hits, and walks should be more highly correlated with on-base percentage.

However, I’m surprised that walks weren’t a more important factor in predicting future batting average.

One final thing to consider: We all should know by now that projecting weighted on-base average (wOBA) or on-base percentage is more valuable than simply looking at batting average.

**References & Resources**

All data comes courtesy of FanGraphs

I would recommend holding off on using regressions for baseball analysis until you take Econometrics. An r-sqaured of 18% also means that 72% (aka a majority) of the variation in batting average has nothing to do with previous year’s batting average. It would also be helpful to know the coefficients, standard deviations, and confidence intervals for your regressions.

Nice analysis as usual, good job.

I’m curious why you didn’t use the BB/K ratio in any of your regressions, particularly since you even mentioned it early on. Shandler’s Baseball Forecaster uses BB/K as one of the keys to understanding how good a hitter is, also their contact rate, and then finally by contact rate and walk rate.

Furthermore, I’ve never seen anyone on any of the baseball analysis sites use walks minus strikeouts divided by PA, is that something new you tried, or is that a new way people are using to analyze hitters?

Yes, it is very surprising that walks aren’t a more important factor in predicting future batting average, but perhaps to my point above, maybe you should be checking BB/K.

Lastly, about your final thing, I understand the need to point out that there are more valuable metrics to look at when evaluating a hitter, but what I think gets really lost by most novices coming into sabermetrics is that while on-base percentage is a better indicator than batting average, it does not mean that there is no value in batting average.

That’s my pet peeve about saber discussions today, too many people focus solely on walk rate and on-base percentage, ignoring the batting average. The higher the batting average, the higher the hitter’s OBP and SLG, OBP is only part of the equation in evaluating a hitter.

Yet few people seem to understand that, and I’m realizing that this is because it is drilled into them that “OBP>BA” so look at OBP for a better understanding of the batter’s abilities, batting average is old baseball.

Batting average is still important. It is just not as indicative of a batter’s ability as OBP. But you know what, it is not an either/or situation, we can still look at a hitter’s BA, as well as his OBP and SLG, it is not like we can’t look at BA anymore, that it has no value.

For it is hits that drives in runs the vast majority of the time, walks do not.

I assume that is a typo on Rowdy’s part: that should be 82%

I’ve been wondering about that myself, thanks Rowdy for pointing that out. It’s been 30 years since my last stats class, so it is very rusty, but I see that there would be these low r-squared but that they are “significant”, and I think I understand that, but it would be helpful to get a layman’s view of what that means exactly, once in a while. Not all of us are stats or econometric experts, but many of us know enough to pick up things or understand reminders.

Like for this, the 18% means that the link between the batting average in year 1 explains 18% of the batting average in year 2, which is a very low correlation from what I recall. However, it is significant because there is enough data backing it (to some level of significance, I think usually 95% is the goal, though some go with 90% and other 99%). I think it would be interesting, to Rowdy’s point, to also show what the significance level is for each of your regression tries.

I agree that you don’t want to look only at BA. Their may well be a significant correlation between BB% and ISO.

@Rowdy

1.) I have taken econometrics

2.) I think we’re saying the same thing. I said ” we tend to see a fairly weak correlation in year-to-year batting average” and then you said “An r-sqaured of 18% also means that 72% (aka a majority) of the variation in batting average has nothing to do with previous year’s batting average”.

Theres a lot of variability in batting average in year-to-year samples and this r-squared reflects that

3.) I can email you the annova table for each test if you’d like. I try to keep those things you asked for (CI, SD, etc.) out of the actual article, as not all the readers are stat-saavy so I’m not trying to scare people away.

I try to get the relevant, important numbers to illustrate the point.

@OBC

I tested BB/K, but left the results out of this piece because the r-squared was only 2.8 percent.

Also, I don’t think what I was saying is batting average is not important. OBP is essentially batting average plus walk rate, so hits are still very important in saber-discussion, especially when we consider something like wOBA.

Also, all of these tests were at 95% significance level. That is the norm for all of the regressions that I run.

Thus, a sample of almost 900 and an r over 0.4 will be very significant at a 95 percent confidence level.

@Glenn

2) I was referring to the point in your 3rd “Study” paragraph which asserted that the 18.1 r-sqaured “correlation was fairly strong, and about where I expected it to be”. Could you explain what you meant by that?

3) I appreciate the offer, but all I was really suggesting was that you provide a general indication what your findings would imply. A variable can be statistically significant but not relevant because the coefficient is too small. If a 1 point increase in previous year batting average leads to a 1 point increase in subsequent year BA, then it doesnt really matter if it’s statistically significant.

2) What I meant by that has to do with the issue of sample size. Looking at any year to year correlation in baseball isn’t going to return a ridiculous r-sqaured of .90 or something. One season in baseball is small sample size for most statistics, including batting average.

Batting average has the variability of BABIP working against it, which is why I said we tend to see a fairly weak correlation, and then that is what I saw “aka what I expected it to be”. Explaining 18 percent of the variability in next season batting average is not bad at all, given the random variation in BABIP, park factors, etc.

3) The slope for batting average was ~.45. This does not mean that a one point increase in previous year batting average leads to a one point increase in the next season, all it says is that we’re using about 45 percent of batting average as true talent of the hitter and the regressing the other 55 percent back to the mean through a constant. Which makes sense given the correlation for the sample was about .43