Introducing the Rate Similarity Score

Bill James introduced the concept of similarity scores by comparing career totals in games played, at-bats, runs, hits, and other offensive stats. Most of the stats used are counting stats, and therefore players can be similar only if they are similar hitters and have similar numbers of plate appearances.

This is fine as far as it goes, but it doesn’t identify similar batters with very different career lengths. So what if you made a similarity score based on purely rate stats? It should identify similar offensive players, no matter how many games they played.

Enter the Rate Similarity Score (RSS), which compares the rates of singles, doubles, triples, home runs, walks, hit by pitches, and strikeouts per plate appearance, and the number of stolen bases divided by the number of singles plus walks plus hit by pitches. This last ratio is meant to crudely estimate how often a player attempted to steal a base when given the opportunity, without using play-by-play data, which is not available for every player.

The RSS is determined by first finding the differences between the rates of two players in the above categories, divided by typical maximum rate differences for those stats. Then a root mean square difference (RMSD) is found over the eight rate stat ratios; the RSS is then the difference between the RMSD and 1. Two identical players would have a RMSD of 0 and therefore an RSS of 1, while two extremely dissimilar players would have an RSS near 0, although in practice the minimum RSS is not too far below 0.400. The RMSD can be thought of as the distance between players in an eight-dimensional space of rate stats.

Rates have been used to determine the similarity of players in The Hardball Times articles by Chris Jaffe on a season-by-season basis, and by Josh Kalk for pitchers using Pitch F/X data, and by Baseball Prospectus as part of their PECOTA rating system. I like RSS because it is relatively simple and uses career data, and provides a nice complement to the traditional similarity score.

The maximum rate differences used to normalize the rate differences between two players are: .160 for singles, .053 for doubles, .031 for triples, .076 for home runs, .180 for walks, .051 for HBP, and .330 for strikeouts (all using the rate per PA), and .460 for stolen bases divided by singles plus walks plus HBP. These values were taken from the maximum rate differences for all batters with at least 3000 PA, rounded to two significant figures.

Using this definition of RSS, the most similar players among those with 5,000 PA or more through the 2011 season (915 total players) are (# represents the similarity rank):

# Player Best match RSS
1/2 MiltStock/Dave Cash Dave Cash/MiltStock 0.97897
3/4 OssieBluege/PhilRizzuto PhilRizzuto/OssieBluege 0.97615
5/6 BuddyMyer/GeorgeGore GeorgeGore/BuddyMyer 0.97376
7/8 JimmyWilliams/GeorgeWood GeorgeWood/JimmyWilliams 0.97311
9/10 ChrisSpeier/JimSundberg JimSundberg/ChrisSpeier 0.97125
11 EdKonetchy JimmyWilliams 0.9712
12/13 ChrisChambliss/WillieMontanez WillieMontanez/ChrisChambliss 0.97019
14/15 LeoDurocher/HodFord HodFord/LeoDurocher 0.96965
16 KeithMoreland WillieMontanez 0.96952
17 SparkyAdams MiltStock 0.96924
18/19 TommyCorcoran/GeorgeCutshaw GeorgeCutshaw/TommyCorcoran 0.96899
20/21 RobertoAlomar/BarryLarkin BarryLarkin/RobertoAlomar 0.96851
22/23 RoyMcMillan/ChicoCarrasquel ChicoCarresquel/RoyMcMillan 0.96837
24/25 LukeAppling/StanHack StanHack/LukeAppling 0.96825
26/27 BobbyBonilla/FredLynn FredLynn/BobbyBonilla 0.96795
28/29 LarryBowa/BonesEly BonesEly/LarryBowa 0.96745
30/31 DaveWinfield/ReggieSmith ReggieSmith/DaveWinfield 0.96727
32/33 RondellWhite/BrianJordan BrianJordan/RondellWhite 0.9666
34/35 RonSanto/VicWertz VicWertz/RonSanto 0.96647

Double entries occur when a player is the best match for the person who is their best match.

Note that RSS scores don’t account for length of career, different eras (although in principle it could if you used neutralized stats), defensive position, or defensive ability. For example, Phil Rizzuto actually had fewer PA than Bluege, but amassed more WAR (30.8 versus 24.7 according to baseball-reference.com) in that shorter time, played in a less offensive era, and was also MVP one year; Rizzuto, of course, ended up in the Hall of Fame. The Alomar/Larkin pairing is interesting, since they are both middle infielders and both were inducted into the Hall of Fame. Their main differences are that Alomar had 1,343 more PA, and, as a compensating factor, Larkin played a more demanding position (shortstop versus second base).

Bobby Bonilla and Fred Lynn had very similar career lengths and similar slash stats (AVG/OBP/SLG): .279/.358/.472 and .283/.360/.484, respectively. Lynn was Bonilla’s most similar using the standard similarity score—baseball-reference.com version—and Bonilla was Lynn’s second most similar. If there were no positional points for the standard similarity score, Bonilla very likely would have been Lynn’s most similar. You could just use slash stats to determine a rate similarity, but RSS has more detail since it considers doubles, triples and home runs separately (rather than simply total bases), differentiates between walks and hit by pitches, and also includes stolen bases and strikeouts.

The above table has a number of other interesting comparisons. Most similar by RSS, Winfield and Smith were actually most similar by the standard similarity test at ages 27, 29 and 30, but Winfield went on to a much longer career (12358 PA versus 8051). Otherwise Smith compares favorably with Winfield, including being a much better defender. Stan Hack was most similar to Luke Appling, except that Appling had about 20 percent more PA. Recent Hall of Fame inductee Santo had 1958 more PA than Wertz, and played in an era with less offense.

A very unique player according to his rate stats (among the 915 players with at least 5000 PA) is Billy Hamilton; he had the lowest RSS for a most similar player (Eddie Collins, .856683). Others who had a most similar with a low RSS were Otis Nixon (Dave Collins, 87250), Barry Bonds (Babe Ruth, .87476), Ted Williams (Babe Ruth, .87730), Hughie Jennings (Tommy Tucker, .87831), Rickey Henderson (Joe Morgan, .88203), and Mark McGwire (Harmon Killebrew, .88644). For all but Jennings, Stovey and Greenberg, the player with the highest RSS for these players also shows up in their Top 10 most similar list, according to the standard similarity score.

Other interesting pairs who were most similar to each other using RSS:

# Players RSS
46/47 HaroldBaines/RichieZisk 0.96592
52/53 AndruwJones/JeromyBurnitz 0.96552
82/83 BrooksRobinson/BuddyBell 0.96262
91/92 LouWhitaker/FrankieHayes 0.96214
96/97 JohnnyDamon/AmosOtis 0.96202
113/114 RobinYount/AlCowens 0.96087
116/117 AdrianBeltre/VernonWells 0.96085
132/133 IvanRodriguez/GarretAnderson 0.9598
162/163 EddieMurray/KentHrbek 0.95898
166/167 KenBoyer/BillWhite 0.95881
176/177 BernieWilliams/WillClark 0.9581
184/185 RonCey/RicoPetrocelli 0.95771
209/210 FelipeAlou/TonyOliva 0.95707
230/231 GaryMatthews/BobbyMurcer 0.95584
255/256 JoeTorre/BobWatson 0.95445
263/264 FredMcGriff/DavidJustice 0.95397
276/277 RaulIbanez/AubreyHuff 0.95349
278/279 JimRice/BillSkowron 0.95347
287/288 JohnnyBench/TinoMartinez 0.95316
304/305 CurtFlood/ThurmanMunson 0.95257
309/310 CarlYastrzemski/AlvinDavis 0.9524
314/315 TedSimmons/SmokyBurgess 0.95228
323/324 BobbyRichardson/GlennBeckert 0.95198
327/328 MoisesAlou/MagglioOrdonez 0.95181
332/333 ReggieJackson/GregVaughn 0.95167
337/338 DwightEvans/J.D.Drew 0.95156
358/359 JeffBagwell/LanceBerkman 0.95086
366/367 EddieMathews/WillieMcCovey 0.95056
380/381 WillieRandolph/JimGilliam 0.94979
385/386 PatBurrell/TroyGlaus 0.94976
388/389 JasonVaritek/CaseyBlake 0.94964
416/417 RichieSexson/TonyClark 0.94885
420/421 MickeyRivers/RalphGarr 0.94865
445/446 JimEdmonds/DannyTartabull 0.94771
484/485 BobbyGrich/DarrellPorter 0.94691
506/507 RockyColavito/HankSauer 0.94621
534/535 ToddHelton/EdgarMartinez 0.94483
574/575 YogiBerra/TedKluszewski 0.94266
644/645 CapAnson/TonyGwynn 0.9381
651/652 CarlosDelgado/MarkTeixeira 0.93792
657/658 HonusWagner/EdDelahanty 0.93772
672/673 LouBrock/MookieWilson 0.93695
692/693 DonMattingly/FrankMcCormick 0.93613
701/702 EllisBurks/CarlosBeltran 0.9356
744/745 TimRaines/KenLofton 0.93305
764/765 FrankRobinson/GarrySheffield 0.93042
864/865 DanBrouthers/JoeJackson 0.91766
888/889 VinceColeman/OmarMoreno 0.90851

In some of the cases shown here a most similar player is in the Hall of Fame, or often discussed as a possible candidate, while the other is not. Usually it is a difference in career length (number of PA) or fielding position that accounts for our different assessments of two players who were actually very similar offensively, as measured by their RSS. Although RSS does not include a positional factor (unlike the traditional similarity score), in many cases two most similar players did play the same or similar positions, as can be seen in the above tables. This should not be too surprising since a given field position is often manned by similar players. Also interesting is that in a few cases the most similar player was a teammate: Boyer/White, McGriff/Justice and Bagwell/Berkman.

The two most dissimilar players (lowest RSS) with 5000 or more PA are McGwire and Jennings, with an RSS of .37948. The next two smallest similarities also involve McGwire with Willie Keeler (.38546) and Joe Jackson (.38624)—not a lot of home runs or strikeouts and a lot of triples and stolen bases are the major distinguishing characteristics when comparing to McGwire.

The players who had the most players to which they were most dissimilar (least similar) are:

Player              # most dissimilar to
MarkMcGwire                 526
HughieJennings              293
VinceColeman                 63
WillieKeeler                 23
BuckEwing                    8
BillyHamilton                1
JoeJackson                   1

Clearly McGwire had a very unusual batting profile. Adam Dunn was the second most dissimilar to 395 players, and he was the third most similar to McGwire.

The player who had the largest least-similar number was Brian Jordan, whose smallest RSS (.63576) was with Vince Coleman. Jordan might be considered the most average player, not too far from anyone else. A close second was Cy Williams, with a smallest RSS of .63390, also compared to Vince Coleman.

There is a lot more that hasn’t been shown here. If anyone wants to look at an Excel spreadsheet with the five most similar and five most dissimilar players for every hitter with at least 5000 PA through the 2011 season, the file may be found here. The spreadsheet also has the top five similars and dissimilars for all players with 3000 or more plate appearances.

You could include other stats such as runs and RBI in RSS, but those are more situational and are more affected by a player’s teammates. The number of PA/G could also be used, which would measure the extent to which a player was used as a starter or substitute, but in developing RSS I wanted something that compared players’ offensive performance on the field. I have calculated RSS’s including runs, RBI and games, and in many cases the same names show up as most similar, but sometimes not.

Of course you can define an RSS for pitchers as well, which will be the topic of a future article. The spreadsheet linked above also contains pitching similars and dissimilars. Another possibility I have been looking at is to neutralize stats for era before calculating RSS.


5 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Michael
12 years ago

Agree, add position, also would be nice to add a defense component.

David
12 years ago

Pretty cool.  How about adding park effects and era effects?

Kerry
12 years ago

As I mentioned, I have thought about era effects; they are relatively easy to account for if you were to use neutralized stats. However, those are not so easy to get from my source (baseball-reference.com)—I got all the stats I needed for my analysis in one list using the Play Index, but you can’t do that for the neutralized stats. Given the data, I would be glad to do that—it would be interesting to see how much changes.

Regarding a variable for position, I purposefully avoided that, as my intent was to find similar offensive players, regardless of position. It was interesting how often similar offensive players also had similar positions, without specifically introducing a variable for it.

But if you wanted to add a position component, I don’t really like the way the standard similarity score handles it. In bbref’s version, they assign a number to each position: C (240), SS (168), 2B (132), 3B (84), OF (48), 1B (16), DH (0). Now the relative numbers are roughly aligned with the usual rankings of how important the defensive positions are, but are hardly indicative of how similar players are defensively. After all, catchers almost never also play middle infield (Craig Biggio being a notable exception), but they often do play outfield or corner infield or DH. But C is as far as you can get from 1B and DH on the position similarity ratings.

For me it would be better to make a study of how likely it is for players to play one position if they also played another, and somehow incorporate that into the RSS framework (i.e., two players would get a position similarity score between 0 and 1). That would make more sense to me—what do you think?

kds
12 years ago

Fun stuff.  Adrian (Anson) is rolling in his grave.
I think it would be useful to include position as a 9th variable in your vector space.

Bob B
12 years ago

That’s pretty awesome.
I think my favorite was the bit at the end with most frequent least-like list. I’m not sure why, but I totally think that kicks ass!