Workload and Durability (Part 1)
Much has been written on the subject of pitch counts. In some quarters, the notion that high pitch counts are dangerous to a pitcher’s health is an article of faith; the idea makes intuitive sense. There is just one problem — a lack of evidence in its favor.
Earlier this year, Rob Neyer and Bill James published the exceptional Neyer/James Guide to Pitchers — an encyclopedia of the pitching repertoire of nearly every significant pitcher in Major League and Negro League history. In one essay, titled “Abuse and Durability” (pp. 449-463), James runs a series of matched-pair studies, identifying the most similar non-abused pitchers to pitchers listed as “abused” in various editions of Baseball Prospectus based on the Pitcher Abuse Points (PAP) system devised by Rany Jazayerli and Keith Woolner. The results skew in one direction: the “abused” pitchers keep more of their value (on average) than comparable “non-abused” pitchers. That’s right — keep their value.
James concludes his essay speculating about what is behind the phenomenon:
Most injuries to pitchers are not the result of chronic overuse; some are, particularly to young pitchers, but most are not. They’re catastrophic events, just like a heart attack or a torn muscle. They happen suddenly, and they happen when a pitcher goes outside the envelope of his previous conditioning.
Backing away from the pitcher’s limits too far doesn’t make a pitcher less vulnerable; it makes him more vulnerable. And pushing the envelope, while it may lead to a catastrophic event, is more likely to enhance the pitcher’s durability than to destroy it.
And yet, questions linger. James himself notes that since power pitchers last longer (and tend to throw more pitches per inning) than finesse types, controlling for quality of pitcher isn’t sufficient to isolate the effect of high pitch counts. In addressing the issue of pitch count we must be sensitive to differences of pitcher type.
The quality of a matched-pair study depends on how similar your comparison groups are in all respects save for the one under study. On the other hand, pegging the similarity standard too high may lead to too few matches to tell us anything useful. A balance must be struck between sample size and degree of similarity.
Matched-Pair Workload Study #1
Starting with a large pool of players from which to match leads to more good matches. To that end, I settled on a pool of starting pitchers born after 1945 and before 1970. This 24-year period encompasses the baby boom and immediate post-boom generations. All but a handful of pitchers born before 1970 are either retired or no longer starting in the majors, so we don’t need to worry very much about incomplete data.
To start we need to define heavy and moderate workloads for starting pitchers. A heavy workload was defined as exceeding 3,800 estimated pitches(1) in a given year; 3,000 to 3,600 estimated pitches was defined as a moderate workload. Because of the power of the pitch count and the pervasiveness of the five-man rotation, very few pitchers have exceeded 3,800 pitches in recent years (starting 34 times, a pitcher would need to average almost 112 pitches a start).
Group A pitchers were those who had at least one heavy workload season before age 28. Group B pitchers were those who never exceeded 3,600 estimated pitches in a year before age 28. Matches were based on highest similarity score, using single season to single season comparisons, and taking into account the following characteristics:
(1) Strikeouts per Opportunity [K/(BF-IW] |
(2) Non-Intentional Walks per Opportunity [(W-IW)/(BF-HBP-IW)] |
(3) Earned Run Average [ER/IP*9] |
(4) Year of Birth |
(5) Age on July 1st(2) |
(6) The matched pitchers must throw with the same hand |
Here’s a hypothetical example of how similarity scores work in this study. Imagine two pitchers with identical ERAs, strikeout rates and walk rates. These pitchers are the same age (to the day) and are born in the same year. The Group A pitcher, however, throws 750 estimated pitches more than the Group B pitcher. The method considers this a perfect match — earning 1,000 points. In actual cases, the differences in each category result in points deducted from 1,000; the higher the final similarity score, the greater the (statistical) similarity between the two pitchers.
The final requirement was that no Group B pitcher could be matched with more than one Group A pitcher; the match with the higher similarity score was given priority. Each matched season was designated Year Zero for that particular pitcher. A more detailed description of the comparison method(3) can be found in the footnotes.
Quality Control
Before we turn to the matched pairs, let’s consider what James calls “quality leakage.” James noted that in matched pair studies, there is a tendency for very good pitchers to be matched with lesser pitchers because the former are usually unique. James’ solution was to select pitchers for his “Group B” that were of slightly higher quality (more Win Shares) than his “Group A” pitchers so as to offset the leakage. I took a different approach: I disposed of the worst third (according to similarity score) of the matched pairs.
Of the 69 matched pairs, the 23 least similar pairs were removed from consideration. I believe this is sufficient to alleviate the worst effects of the quality leakage problem, while maintaining a sufficiently large sample. To illustrate, the worst “match” among the original 69 pairs was Nolan Ryan/David Cone. Ryan is nearly a generation older than Cone and walked and struck out batters at a greater rate as a young pitcher. Because they are so dissimilar, there is no reason to think that the Ryan/Cone match tells us anything about durability.
Vida Blue (’71) | Ted Higuera (’86) | John Montefusco (’75) |
Bert Blyleven (’73) | Catfish Hunter (’72) | Mike Mussina (’96) |
Jim Clancy (’80) | Randy Jones (’76) | Gary Nolan (’70) |
Joe Coleman (’74) | Clay Kirby (’71) | J.R. Richard (’76) |
Ron Darling (’85) | Mark Langston (’87) | Nolan Ryan (’74) |
Larry Dierker (’69) | Bill Lee (’73) | Frank Tanana (’76) |
Dwight Gooden (’85) | Dennis Leonard (’77) | Fernando Valenzuela (’82) |
Ron Guidry (’78) | Jon Matlack (’74) |
The “cast-offs” were pooled to create a new group (Group C); I’ll consider them in Part 2 of this series. A few Hall of Fame-type pitchers from Group A made it into the study, most notably Roger Clemens and Greg Maddux. Should we exclude them as well? Arbitrarily removing “special arms” seems like a sensible approach, but it creates its own problems (which I will also consider in Part 2). Hand-picking which pairs stayed and which went was not the path I wanted to go down.
Without further ado, the 92 subjects of Study #1 are:
Group A Pitcher | Sim. | Group B Pitcher | — | Group A Pitcher | Sim. | Group B Pitcher |
Len Barker(’80) | 929 | Jose Guzman(’88) | D.Lemanczyk(’77) | 959 | Bart Johnson(’76) | |
Bill Bonham(’74) | 936 | Ken Forsch(’73) | Greg Maddux(’91) | 959 | Andy Benes(’92) | |
Oil Can Boyd(’85) | 954 | John Burkett(’90) | Dennis Martinez(’79) | 955 | Bill Gullickson(’83) | |
Tom Bradley(’71) | 927 | Reggie Cleveland(’72) | Jack McDowell(’92) | 957 | S.Bankhead(’89) | |
Kevin Brown(’92) | 955 | Pedro Astacio(’96) | Doc Medich(’74) | 967 | Bob Moose(’73) | |
Tom Browning(’85) | 972 | Jamie Moyer(’88) | Mike Moore(’86) | 984 | Andy Hawkins(’86) | |
Ron Bryant(’73) | 956 | John Curtis(’73) | Jack Morris(’82) | 966 | Eric Show(’83) | |
Steve Busby(’74) | 933 | Gary Gentry(’69) | Mike Norris(’80) | 925 | Orel Hershiser(’85) | |
Roger Clemens(’87) | 966 | Erik Hanson(’90) | Melido Perez(’92) | 962 | Pete Harnisch(’93) | |
Jim Colborn(’73) | 928 | Dave Frost (’79) | Dan Petry(’83) | 953 | Jay Tibbs (’85) | |
Joe Decker(’74) | 938 | Buzz Capra(’74) | Rick Reuschel(’74) | 949 | Rick Langford(’77) | |
D.Eckersley(’78) | 971 | Scott Sanderson(’80) | Jerry Reuss(’73) | 944 | Bob Shirley(’77) | |
Cal Eldred(’93) | 953 | Ben McDonald(’92) | Steve Rogers(’77) | 942 | Burt Hooten(’77) | |
R.Erickson(’78) | 950 | Mark Lemongello(’78) | Bret Saberhagen(’88) | 929 | Frank Castillo(’92) | |
Alex Fernandez(’96) | 956 | Tommy Greene(’93) | Jim Slaton(’76) | 953 | Bob Forsch(’75) | |
Ed Figueroa(’76) | 935 | Alan Foster(’73) | John Smoltz(’93) | 936 | Kevin Appier(’95) | |
Mike Flanagan(’78) | 942 | Bob Ojeda(’84) | Mario Soto(’83) | 953 | Tim Belcher(’89) | |
W.Garland(’77) | 962 | Doyle Alexander(’77) | Paul Splittorf(’73) | 941 | John Candelaria(’80) | |
Ross Grimsley(’74) | 935 | Ken Brett (’73) | Dave Stieb(’83) | 956 | Charlie Lea (’83) | |
Mark Gubicza(’88) | 967 | Ken Hill(’92) | Rick Sutcliffe(’83) | 953 | Dave Stewart(’84) | |
Ed Halicki(’77) | 953 | Pete Vuckovich(’79) | Dick Tidrow(’73) | 946 | Glenn Abbott(’77) | |
Pat Hentgen(’96) | 964 | Ramon Martinez(’95) | Frank Viola(’86) | 948 | Britt Burns(’85) | |
Jim Hughes(’75) | 938 | Dave Freisleben(’74) | Mike Witt(’86) | 936 | Jose Rijo(’91) | |
The weighted average performance of the Group A pitchers was 17 wins, 13 losses, 3.52 ERA, 15.0% strikeout rate, 7.3% walk rate, 268.0 IP, and 4,038 estimated pitches.
The weighted average performance of the Group B pitchers was 13 wins, 11 losses, 3.54 ERA, 15.0% strikeout rate, 7.5% walk rate, 216.7 IP, and 3,268 estimated pitches.
The only significant statistical differences between the two groups in Year Zero are those related to workload. Aha, you might say — that’s only one season. Could the Group B pitchers be (in truth) inferior and their Year Zero performance merely a result of a preponderance of career years? Could there be differences in performance in the years leading up to the seasons in question? The numbers for the average Group A and Group B pitcher for the three years up to and including Year Zero …
IP | Pitches | ERA | K rate | W rate | Wins | Losses | |
Group A average | 594.7 | 9008 | 3.54 | 15.3 | 7.6 | 36 | 30 |
Group B average | 452.7 | 6833 | 3.56 | 15.2 | 7.4 | 27 | 24 |
… tell the same tale. Apart from workload indicators, the two groups appear to be a very good match.
Suppose you are the general manager of a baseball team and are considering acquiring one of two pitchers: a 25-year-old pitcher who threw 3,900 pitches in 2004 and a very similar pitcher who threw only 3,300. Your scouts don’t turn up any major differences between the two and their overall performance over the last three years has also been very similar. The one difference is that the first pitcher has been subjected to a significantly greater workload than the second pitcher. Who would you choose and why?
Is surviving the heavy workload a marker of greater durability, or instead does the greater “mileage” mean you’d be better off acquiring the “underused” pitcher? The answer … next week.
References & Resources
(1) Pitches thrown were estimated using the Extended Pitch Count Estimator developed by Tangotiger.
(2)Age was calculated using exact date of birth as of July 1st of the year in question.
(3)Similarity Scores were determined by dividing the assigned weight for each category by the standard error based on the population of 3000+ pitch seasons in the pool. The weights for each category were as follows: strikeout rate= 40 points; ERA= 40 points; Age= 30 points; birth year= 30 points; walk rate= 20 points; estimated pitches=20 points; Total= 180. For all categories (except estimated pitches thrown) the absolute difference between the two pitchers was multiplied by the assigned weight and divided by the standard error. For estimated pitches, the absolute difference from a difference of 750 pitches was multiplied by the assigned weight and divided by the standard error.
Sample Calculation (Figures in blue = standard error)
Pat Hentgen (1996), born 1968: 16.1% K rate, 8.3% W rate, 3.22 ERA, 27.63 age, 4,012 estimated pitches
Ramon Martinez (1995), born 1968: 16.2% K rate, 9.0% W rate, 3.66 ERA, 27.28 age, 3,150 estimated pitches
Strikeout Points: abs(.161-.162)*40/.0400 = 1.00 | Walk Points: abs(.083-.090)*20/.0211 = 6.64 |
ERA Points: abs(3.22-3.66)*40/1.026 = 17.15 | Age Points: abs(27.63-27.28)*30/1.814 = 5.79 |
Year of Birth Points: abs(1968-1968)*30/6.99 = 0.00 | |
Estimated Pitches Points: (abs(750-abs(4012-3150)))*20/354.6 = 6.32 |
Sum of Deductions: 1.00 + 6.64 + 17.15 + 5.79 + 0.00 + 6.32 = 36.90
Similarity Score = 1000 – 36.90 = 963.10 (rounded off to 963**)
** Due to rounding errors in the above calculations, the correct similarity score was not 963, but rather 964 (as noted in the main text)