Similarity Scores: a very beta feature

Last night during the heart-stopping Syracuse-Wisconsin game, Harry and I were talking (okay, Harry was mostly watching his team win by the skin of its teeth) about ways to improve the Brooks Baseball player card system. We exchanged some data and are presenting the first of our “data-driven” search tools—pitcher similarity.

This feature is incredibly beta and likely to change over the next few weeks, but right now when you search for a player (let’s pick Josh Beckett), you will get a table listing other players in the “Josh Beckett Family,” along with the “distance” to each player.

The scores are generated by comparing a vector of pitch speed, frequency, release, spin angle, and spin rate using MATLAB’s knnsearch algorithm to identify neighbors. Currently, we’re presenting the top five neighbors for each pitcher.

These are not perfect right now. We haven’t weighted the scores yet (that’s another conversation over basketball), so while we do a good job representing pitch mix and style, we’re not doing a very good capturing pitch speed yet.

There are also a few pitchers with hardly any comparables. Matt (@HouseOfTheBB) noted that neither Roy Halladay nor Mariano Rivera have a comparable pitcher at all!

Punch in a few pitchers, and let us know how our system is doing. Let us know over Twitter if we’ve really missed on someone. I’m @brooksbaseball, and Harry is @harrypav.

Newest Most Voted
Inline Feedbacks
View all comments
10 years ago

Had an auction-keeper draft last week. 10 team, AL only, 12 fielders, 8 pitchers (field gets thin fast). My last pick was Vargas at $1. I plugged him into your system and his closest match is Lester(283). Is this a good sign? How could this be used for a fantasy edge?

Harry Pavlidis
10 years ago

Interesting idea, but none of this is directly tied to performance. Naturally speed and movement are correlated but not the end of the story. Adding in pitch mix, height and release point are intended to bring them closer than just stuff alone. It’s also based on a single year and a work in progress.

So, that said …. 283 is pretty good, but the closest I’ve been finding are 120-140. 400 is pretty far apart IMO, but the 200s are a common area. We’ll see how things develop.

Nick Steiner
10 years ago

Have you guys read this?

How are you calculating the sim scores?

Harry Pavlidis
10 years ago

yep, good to be reminded—- pitch location is not a variable, but it should be, split by batter hand. Oh, Dan…

CJ in Austin
10 years ago

The closest I’ve found is Roy Oswalt and Zack Greinke at 97.  One of the oddest: most similar for Jordan Lyles is Livan Hernandez…the young and the old.

Harry Pavlidis
10 years ago
Max Marchi
10 years ago

Having Mo with no comps seems to me a hit for the algorithm.

Dan Brooks
10 years ago

The algorithm is the Matlab KNN Function with some Minkowski metric as the distance function. So, really, the fact that Mo has no comps is more a win for the quality of the data than anything else. =)

10 years ago

Have you considered putting all these data into an ordination like principal components analysis or nonmetric multidimensional scaling?  for what it’s worth, I generated an nmds ordination based on pitch usage data only

10 years ago

Pretty cool, but it fails the kuckleball test:


Pitcher Distance

Michael Ekstrom 1964
Craig Kimbrel 1964
Evan Meek 1964
Anthony Slama 1966
Chris Scholl 1969


Pitcher Distance

Parker Frazier 1863
Matt Daley 1864
Erik Hamren 1869
Donn Roach 1992
Kyle Waldrop 2015

When I saw Wake’s clustered in the 1960s I wondered if that was near a max distance, but Dickey blows him out of the water.

Harry Pavlidis
10 years ago

Wake and Dickey throw different speed knucklers (Wake also has more than 1 speed but rarely throws the super slow … excuse me, rarely threw the super slow), also Wake threw slower fastballs than Dickey’s sinker, Dickey throws changes.
That said, the weightings are going to be adjusted and we may add inputs (I already gave Dan all the data split by batter hand and threw in some pitch location stuff just to play with)

10 years ago

I would be interested to see what kind of year to year numbers this would spit out.  it would give and idea of how much and individual pitcher changed year to year.  Also doing an even odd comparison between to see how similar a pitcher is to himself just to see how good this metric is.