Building Better Statistical Incentives

by Neil Weinberg

Don Sutton is the all-time leader in quality starts with 483. (via Adam Fagen)

Statistics help us made sense of the world. We face a constant barrage of stimuli. Some of it is important, some of it isn’t. Even if we could record it all and play it back on demand, we would still struggle to glean anything terribly useful.

Baseball is no different. If you had the ability to watch all 2,430 games every year, you certainly would develop a sense of which players were truly exceptional and which teams were best crafted, but without processes for capturing and organizing data, you’d be at a loss when confronted with even the most straightforward questions.

To understand the game, we need to document it. The games are too numerous and their events too complex for our brains to hold more than a few in our heads in much detail. But if we collect the data in a regimented way and organize it properly, those hundreds of thousands of plate appearances each year start become comprehensible.

A list of the outcomes of a player’s 700 plate appearances is almost impossible to assess, but if you categorize them into singles, doubles, triples, home runs, etc., it becomes much easier. If you summarize the rate at which the player reached base or collected extra bases, it becomes even easier to comprehend. Suddenly a mess of information becomes knowledge.

You can start with 700 plate appearances spread across six months and boil them down to a single value that compares the offensive contribution of a player to the rest of his league (e.g., Weighted Runs Created Plus). That single number tells us something meaningful that we can’t see if we’re looking directly at those 700 individual observations. Combining and summarizing the data also provides value because it strips away the idiosyncrasies of the individual observations. More data mean less opportunity for outlier events to shape our perception.

This is a long way of saying that building complex baseball statistics is a worthwhile endeavor. The game is complicated, and our efforts to summarize it should reflect that. If you’re asking a complicated question, your attempt to answer it should be sophisticated.

Perhaps the most complicated question in all of baseball analysis is the one of pitcher performance. The act of pitching is complex, and the results depend on many factors outside the pitcher’s control. Cutting through that complexity is challenging.

In the earliest days, we focused on wins and losses, then we shifted to earned run average. In recent decades, we’ve tried to take a defense-independent approach with statistics such as fielding independent pitching (FIP), expected fielding independent pitching (xFIP), and deserved run average (DRA).

For years, we’ve been chasing better and better ways to measure pitcher performance. Better data and better analytical tools have opened new doors, but as our brightest minds chase the holy grail of the Single Best Metric, it is important to stop and consider the purpose of baseball statistics–they help us make sense of the game.

A lot of the attention of our best analysts is spent trying to break new ground. They chase after solutions to the game’s most complicated and difficult questions. But there remain areas of less novelty where we haven’t quite finished our work.

There is absolutely nothing wrong with the growing suite of complex pitching statistics. They help us answer all sorts of important questions, but in our haste to build bigger and better metrics, I worry we may have lost sight of the fact that we also need statistics that are simple because sometimes we have simple questions.

We also don’t see much effort spent trying to replicate existing work from scratch to verify its quality, because most replication efforts don’t end with anything exciting to publish. Beyond that, we rarely see straightforward and plain English explanations accompanying new statistics to help interested, but non-expert readers follow along. All these components are critical for our understanding of the game, but they don’t get the kind of attention they deserve.

I spent some time this fall studying quality starts because I believed it was one of those simple statistics that had the right idea but sort of failed to satisfy its audience. We’ve long understood that wins don’t provide a good single-game assessment of pitcher performance, but our main attempts to improve upon wins–game score and quality starts–have failed to replace wins in common parlance for a variety of reasons. Among statheads, we’ve generally ignored the desire for a good, single-game assessment and moved on to full-season metrics, but a single-game stat is clearly something the public desires.

What I learned when I was trying to refine the quality start was that its creator, John Lowe, actually did a pretty good job. He drew the line between quality performances and non-quality performances given the underlying success rate of teams when the starter goes six innings and allows three or fewer earned runs versus when he does not.

However, trying to work through the development of a simple, but useful statistic got me thinking about the way we approach the creation and refinement of baseball statistics more broadly. (It also gave me new respect for the quality start, and frankly for Lowe’s ability to see patterns without the help of Baseball-Reference.)

Particularly, I was left wondering why there isn’t more energy devoted to refining the simple statistics, my ill-fated attempts to improve the quality start notwithstanding. After all, we build statistics to help us understand the game, and the simple stats are a key component of that knowledge base. I think the same applies to replication and plain English documentation efforts.

There are a couple of trends I think explain this gap in our approach. The first is that Statcast (and previously PITCHf/x) has created an environment in which research is guided by what data exist rather than what questions we really want to answer. Statcast has the potential to be a revolutionary tool, but the analytical community is at the mercy of MLBAM in terms of what data are collected and released. One problem with that is that people who want to use and display their impressive analytical skills are shoehorned into working on only the kinds of problems Statcast allows you to study.

The world is thinking in Big Data at the moment, and Statcast is baseball’s Big Data. But there are small data problems that need the attention of smart people. Solving those problems requires the creation of data rather than harnessing data that already exist. We’ve allowed our research agenda to be disproportionately influenced by the availability of data rather than the value of a particular question.

The second trend is the degree to which intellectual discovery and commercial/professional success are beginning to run together in the baseball world. In the early days of sabermetrics, everyone was a hobbyist, and the main focus was merely knowledge for knowledge’s sake. Today, as stat sites have become businesses and internet analysts have developed a pipeline into major league front offices, the incentive structure is changing.

This isn’t to say anyone is acting inappropriately, but there is a lot of pressure to develop new and attention-grabbing concepts. There is much less incentive to replicate and test previous work, or to document new work in ways that make it accessible. It’s unlikely that a slightly better quality start would generate much web traffic, and it probably wouldn’t get anyone an interview with a baseball operations department, but it would fill a need for baseball fans that has existed for a long time.

Neither of these trends is fatal, but they reflect a community that is drifting toward building things that are professionally advantageous rather than things that serve the audience. Public sabermetrics is full of talented and hard-working people, but the competitive and secretive bent within the community is growing.

We would all be better off if we talked more openly about what we’re working on prior to publication. We would all be better off if we looked around at the ongoing work and identified where the gaps are. We need processes in place to push people to work on worthwhile problems even when the professional and commercial incentives don’t guide them to that work.

It’s not surprising that our best and brightest are responding to the incentive structure we have in place, but as is the case in lots of industries, the natural incentive structure hasn’t yielded ideal outcomes. The way to change this is for the community to make this kind of work socially desirable.

We need more replication, more plain English documentation, and more refined simple statistics that help people understand the game. Those aren’t necessarily profitable avenues of research, but we’ll all be better off going forward if we collectively incentivize that kind of work.

Some simple ideas include public demands for plain English explanations and updates to those explanations after receiving questions from readers. Major sites should also offer space on the site to anyone in the community who wants to replicate any previously published work. We could host competitions for high school and college students to see who can come up with the best refinements to well-known statistics.

Statistics help us understand the world, and their development should be based upon the needs of the audience rather than the incentive structures surrounding the creators. It wouldn’t take much to push us in a better direction, but it will take agreement from influential people that such a push is worthwhile.

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG