How Sabermetrics saved my dissertation

I spent the summer of 2007 chasing after three pixels. At that point, I was in the midst of performing the statistical analyses on what would soon become my dissertation. My degree is in clinical psychology, and I was working with a database concerning the longitudinal development of mental health concerns, like depression and anxiety, among a sample of 400 Chicago teenagers. I was armed with a few hypotheses that I wanted to test.

Now, it’s common sense that stressful events in one’s life might lead to mental health problems, particularly depression. This has been shown in countless studies using multiple methods. I figured that re-duplicating that finding in my dissertation study was a mere formality, and I listed it in my proposal as only a “preliminary analysis.”

At first, I didn’t catch the negative sign, but those three pixels ruined my summer. I ran a regression predicting year 2 symptoms from year 1 stress, controlling for year 1 symptoms. And then, something weird happened. The coefficient on year 1 stress was … negative. In essence, the regression was saying that the more stress you had in your life in year 1, the better off you were psychologically in year 2.

Understand, reader, that the stressors we were measuring included such major events as being assaulted or having a family member die. Here, the regression was telling me that if I wanted to make these kids better off, I should assault them and murder their family members. This made no sense.

The summer of 2007 was also when I started getting heavy into Sabermetrics. At the time, I had just begun researching and writing for the blog Statistically Speaking. I have to admit, one of the draws to Sabermetrics was that I could open up SPSS (a common statistical analysis program used by social scientists) and have Retrosheet files open instead of my dissertation data set. If anyone asked, I could say that it was my dissertation data and no one would know. Little did I know that Sabermetrics would actually save my dissertation.

My adviser, Dr. Grant, wanted to know what was going on with that negative sign. I tried a few sets of numerical gymnastics routines, and gave her a few drafts explaining how I managed to rule out this explanation or that, but never quite figured out what happened to begin with. I was starting to feel hopeless (another risk factor for depression.) Dr. Grant wouldn’t declare me ready to defend my dissertation until I had cleared up why the negative sign showed up in the first place. Those are words that every doctoral student dreads hearing. I needed a flash of inspiration.

Sure enough it came, but in the oddest place. I was reading and participating in a discussion of regression to the mean as it related to something or other. We know that all sorts of baseball stats regress to the mean, particularly if they aren’t very reliable stats to begin with. What if the same thing were happening to my stress variables? Many of the events that I was dealing with (assault, suicide of a friend) are low-frequency events (although, sadly, not low enough).

Low-frequency events are usually unstable on an individual level. A quick year-to-year correlation between year 1 and year 2 stress levels showed a correlation in the low .30s. If you had five major events happen to you in year 1, that doesn’t tell us very much about what will happen in year 2. Due to the low correlation, my adolescents were very likely to regress to the league… er, neighborhood… mean. So, the kids in year 1 that had high stress levels were likely to see their stress levels drop in year 2.

That still isn’t enough to explain why more stress would lead to fewer symptoms. Stress levels might be going down, comparatively, but the kids in the sample were drawn from an area of Chicago with a lot of stress. How are they not depressed? It turns out that the answer also came from baseball.

One of the most seductive fallacies that fans buy into is the idea of the streaky hitter or the hot-handed pitcher. If a player has had a good couple of days in a row, fans (and managers!) seem to believe that this streak will keep going forever. In the playoffs, you will often see managers turn to a reliever who was awful in the regular season, except for the last two weeks, to get the team out of a tight spot in the eighth inning. (Paging Chan Ho Park. Chan Ho Park, please report to the bullpen.) I reasoned that perhaps the teens in my study figured that since life was getting better (stress levels were regressing downward to the mean), there was no point in being depressed.

Sure enough, a simple difference in stress levels from year 1 to year 2 proved to be a fantastic correlate to symptoms levels, better than any of the other variables. The adolescents in the study weren’t responding to the overall levels of stress present. Instead, they were responding to trend line, and the trend line was a mere product of regression to the mean.

At my defense, I actually brought up the baseball analogy. (One of my committee members is a huge White Sox fan.) Baseball really is a microcosm of life. In the case of these teenagers, who were drawn from a very stressed area in Chicago, the tendency to interpret regression to the mean as a trend line, and a trend line as something permanent, served a protective function. The kids were less depressed when they did it. Baseball fans (and managers) certainly do the same thing, probably for the same reason.

The problem is that there’s a difference between feeling good about the situation and it actually being a good situation. As a clinical psychologist and as a Sabermetrician, it’s often my job to tell people the difference between the two… and often to have them not believe me.

In any case, I’d like to end with a thank you. I owe a small chunk of my degree to the Sabermetric community.

Newest Most Voted
Inline Feedbacks
View all comments
13 years ago

Very cool.  Psychology theses’ ended up getting me into sabermetrics.  I was always a fan of my psych stats classes as an undergrad, and the professor who taught those classes happened to be a big baseball fan.  So I ended up doing my psychology thesis on the home field advantage in professional sports.

Then Moneyball came out, right when I was in a master’s program for counseling psychology, with a focus on sport psychology.  Again, I had flexibility on what I could focus my thesis topic on, so I chose to do a literature review of sabermetrics up until that point (late 2003/early 2004).

Bill C
13 years ago

Terrific piece which I am sharing with staff.  (I work in the health informatics field)

13 years ago

Very interesting.  I can attest that sitting through an econometrics class, my own way of understanding the monotonous linear algebra on the board is to reference it to baseball or other sports examples which the techniques could be applied to.  It definitely helps to be interested in how data works together.

I do know of an NSF (or maybe it’s NIH) grant currently funding research in teaching statistics at the undergraduate level using sports, and especially baseball, as its base for discussion.  I don’t think it’s so much sabermetrics, but using interesting topics to apply dull ones that most undergrads aren’t particularly receptive to.

Mike Silver
13 years ago

Great story.

Dan Novick
13 years ago

“I owe a small chunk of my <strike>degree</strike> salary to the Sabermetric community. “

There, fixed.

Pizza Cutter
13 years ago

The check is in the mail, Dan.

13 years ago


Cancel payment on that check. Instead pay us back by discovering the psychological basis for working on sabermetric research instead of ones dissertation. Thats a peculiarly powerful problem. The world will be a much better place for grad students when a cure to that ailment can be found.  There should be a Nobel in there somewhere.

… now, to go back to work on probabilistic models of hippocampal function or pitch sequencing?  …