Databases for sabermetricians, Part One by Colin Wyers November 24, 2009 Stop me if this sounds familiar. You have a question about baseball, the sort of thing that should be answerable with stats, you think. So you go looking for the stat, but you can’t find anyone who computes it the way you want. So you find a site that (maybe) has the raw data you need, copy and paste it into Excel, and bang out a few formulas until you get a result. And all along the way, you’ve thought: There has to be a better way. I’m here to tell you that in fact there is—the relational database. And I’m here to help. First, a warning: I am not a professional database administrator. I am a hobbyist. I may not follow “best practices.” I will not be explaining a lot of relational database theory or lecturing you on the importance of the normal forms. Frankly, I don’t know too much about that myself. And if you have your own database setup and workflow already, your setup may differ from mine. I hope you can still derive some value from this. Also, I run Windows, so this tutorial will be written for Windows users. If you use a Mac, the people at Beyond the Boxscore have some helpful hints. Most modern databases use something called the Structured Query Language, or SQL. It is, yes, a programming language. But please, don’t let that scare you. It’s simply a way for you to "talk" to a database and ask it questions. We’ll get into that more in a minute. First, we need to set up a database. Installing the software There are many databases to choose from. I use MySQL. It’s free, which is nice. And there’s a large user base, which means there are plenty of tools and tutorials for you to look at. First, you need to download and run the installer for MySQL 5.4 CE Essentials. Choose a “typical” install. You’ll typically have to wait a few minutes for files to copy. Then there will be some ads. Skip over them. Make sure that “Configure the MySQL server now” is checked. (I uncheck the box about registering the software with MySQL, but that’s up to you.) Chose “Detailed configuration.” Then you’ll go through a set of pages asking you questions. Here are the answers: Developer Machine Non-Transactional Database Only Decision Support (DSS)/OLAP Continue on, using the defaults. When prompted, either add a password or uncheck the box to alter security settings. (This will leave your root password blank.) Once you’re finished, you shouldn’t have to start MySQL at all—it should be ever vigilant, waiting for you to command it to do your sabermetric bidding. (If you get an error message that says "This installation package could not be started. Contact the application vendor to verify that this is a valid Windows Installer package," you need to download and install Windows Installer.) Again sticking with the theme of “absolutely free,” I use SQLyog. You want the “Windows-Binary” from the Community Edition. Installation is straightforward—just download it and click “Next” a few times. Then run SQLyog. From there, you’ll be presented with a screen asking you for the settings to connect to the database. Everything should be correct. If you selected a password, type it into the password blank. Then hit “Connect.” And… voila! You’ve been connected to, well, to an empty database. I suspect you want some actual data in your database. Putting the “data” in database There are obviously a lot of different things you could want to have in your database, depending on your area of interest. Let’s start off modestly, if not exactly small. The Baseball Databank is a record of pretty much every official statistic for every baseball player in the history of the majors. (Right now, it’s current through 2008; the latest update is expected soon after the awards voting is finished.) First, download the ZIP file. Extract the .SQL file from the ZIP, and remember when you put it. Now go into SQLyog. In SQLyog there will be three panes. The left-hand side should list your databases, the top is where you write queries, and the bottom shows the results. Let’s add a database. Right-click in the pane, and select “Create Database.” Call it “BDB,” leaving off the punctuation. Leave the rest as the defaults, and click okay.A Hardball Times Updateby RJ McDanielGoodbye for now. Now, let’s populate the database. Right-click on the database, and select “Restore from SQL Dump.” There should be an button with an elipsis (…) in it. Click that, and select the file we extracted from the ZIP. Select Execute. It’ll ask you if you want to execute the queries in database BDB. You do. This should take a little while. Once that’s done, go into your database and click on the folder that says Tables. If nothing is there, try hitting “F5” on the top row of your keyboard to refresh the browser. Go ahead and click on batting, and then slide on over to the right and click on “Table Data.” Scroll around for a bit. If you’ve ever used a spreadsheet or looked at a table before, the layout should feel familiar. Now it’s time to play around with the database a little. Asking the database questions The fundamental unit of the database is the table, which is further broken down into the row and the column. Each row is a “record” in the database; you can have as many records as you like. Every row has several fields which contain the data you’re interested in. A column is a group containing all of the fields of a certain type in the table. You won’t be entering formulas into cells, like you would a spreadsheet. You interact with the data in the spreadsheet by writing queries. Let’s start off by looking up the hitting stats of the 1990 Royals. Slide up into the query editor and type in this: SELECT * FROM batting WHERE yearID = "1990" AND teamID = "KCA" There is a green play button in the upper left-hand corner; press it. (Or just hit F9.) Most of the column headers should be pretty self-explanatory: H is hits, SF sacrifice flies, etc. And playerID is each player’s unique identifier, taken from the first five letters of the last name and the first two letters of the first name, plus a number. Some are easier to figure out than others. To explain: SELECT is your bread-and-butter SQL command; it tells the database to fetch records out of the table. * is a wildcard; it tells the database to fetch all columns. FROM tells the database what table to fetch from. WHERE tells the database to filter the information. AND means you want the database to match multiple columns. The semicolon tells the database that our instruction is complete. The equal sign is one way to compare two values in SQL. The quotation marks are important; if you don’t use quotation marks, MySQL will assume you’re referring to a column of that name. In some results you’ll see a value of (NULL); that means the field is empty. In the batting table, NULLs are generally used to indicate: Players who appeared in a game but had no official plate appearance (generally relief pitchers, pinch-runners and defensive replacements). Statistics that were not tracked that year (generally sacrifice flies, double plays, etc.). In this case, we really only want to see players who went to the plate. And let’s say we also want to know the “sabermetric triple crown” stats of batting average, on-base percentage and slugging percentage. (Tango’s sabermetric wiki is a great source of formulas to try out, incidentally.) We can add to our query a bit: SELECT *, H/AB AS AVG , (H+BB+HBP)/(AB+BB+HBP+SF) AS OBP , (H+2B+2*3B+3*HR)/AB AS SLG FROM batting WHERE yearID = "1990"AND teamID = "KCA"AND AB IS NOT NULL; You don’t have to break up the different lines; I do that for legibility. You do want to separate each thing you want to select by commas. I put the commas at the start of each individual line, but that’s a personal style preference. SQL simply ignores carriage returns (that is to say, line breaks), tabs and extra spaces, so feel free to “pretty up” your code in a way that makes it easy for you to read. SQL will do any simply arithmetic with the records you provide. (You can also do a lot more complex math with it if you really want to.) The AS command is very helpful—otherwise the column will end up being named “(H+BB+HBP)/(AB+BB+HBP+SF),” which simply isn’t very convenient. Let’s say we don’t want every column, we just want a few columns: SELECT playerID, AB, H, 2B, 3B, HR, BB , SO, IBB, HBP, SH, SF , H/AB AS AVG , (H+BB+HBP)/(AB+BB+HBP+SF) AS OBP , (H+2B+2*3B+3*HR)/AB AS SLG FROM batting WHERE yearID = "1990" AND teamID = "KCA" AND AB IS NOT NULL; Bored with the 1990 Royals yet? Let’s try something different. SELECT playerID, yearID, teamID, HR FROM batting ORDER BY HR DESC LIMIT 50; ORDER BY tells the database to sort the list; Desc tells it to go from large to small. (Asc is the corresponding command to sort from small to large.) LIMIT tells SQL to return only the first 50 records. So that’s the top 50 home run seasons. (Supply your own asterisks where you feel it necessary.) But what if we want career totals? SELECT playerID, SUM(HR) AS HR FROM batting GROUP BY playerID ORDER BY HR DESC LIMIT 50; The SUM command will sum everything in the column unless you stop it. GROUP BY tells the database to combine everything based upon player ID. Now what if want to look at the top 50 career batting averages? We obviously don’t want to see some pitcher who lucked his way into going 2-for-3 and never saw the majors again. So let’s say we want a minimum 3,000 career plate appearances. SELECT playerID , SUM(H)/SUM(AB) AS AVG , SUM(AB+BB+HBP+COALESCE(SF,0)) AS PA FROM batting GROUP BY playerID HAVING PA >= 3000 ORDER BY AVG DESC LIMIT 50; Because we’re using the SUM function, we have to use the HAVING function instead of the GROUP BY function. Note the >= – that’s greater than or equal to. So, what’s that COALESCE doing there? Remember, sacrifice flies weren’t officially recorded until 1954. For years prior, those values are NULL in the database. Trying to sum a NULL value chucks everything out. Using the COALESCE function in this way tells the database to substitute 0 in the place of a NULL. Otherwise, players like Ty Cobb would be excluded from the list (precisely the sorts of players we don’t want excluded!) Now let’s try writing a query to list the 50 lowest career earned run averages, minimum 1,000 innings pitched. The Databank doesn’t list IP for pitchers, simply outs. Remember to divide by three: SELECT playerID , SUM(ER)/SUM(IPOuts/3)*9 AS ERA , SUM(IPOuts/3) AS IP FROM pitching GROUP BY playerID HAVING IP >= 1000 ORDER BY ERA ASC LIMIT 50; If you run this query, you’ll have a lot harder time recognizing the top leaders than the top batting average leaders. A lot of these guys were deadball-era pitchers. So let’s find these guys’ names. Working with joins, a first glance Often, all the data you want to use won’t be stored in a single table—you’ll have some data in one table, and some data in another. That’s where joins come into play. A join lets you hook two (or more) tables together for querying. There are two ways to do a join: explicitly and implicitly. When I say an explicit join, I mean a join where you use the JOIN command. An implicit join is one where MySQL is just supposed to figure out that you want it to join. For now, we’ll start off with the implicit join. In this case, what we want are the first and last names out of the master table. In this case, what we’re really after is a player’s full name. So let’s put that together: SELECT CONCAT(nameFirst," ",nameLast) AS Name, playerID FROM MASTER; CONCAT is a fun command for string manipulation. It lets you take multiple strings and combine them. In this case, we took the first name, a space (surrounded by quotes) and a last name, separating them with commas. CONCAT does the rest. Instead of joining to the entire master table, we’re going to join to the results of that query using an alias. In SQL, you use the period separator to indicate what table you want to take data from, like so: table.column where table is the name of the table and column is the name of the column. You can also assign an alias to a query when you use it in the FROM command, like so: SELECT p.playerID , m.Name , SUM(p.ER)/SUM(p.IPOuts/3)*9 AS ERA , SUM(p.IPOuts/3) AS IP FROM pitching p , (SELECT CONCAT(nameFirst," ",nameLast) AS Name , playerID FROM MASTER) m WHERE p.playerID = m.playerID GROUP BY playerID HAVING IP >= 1000 ORDER BY ERA ASC LIMIT 50; So now, instead of seeing that walshed01 is the career ERA leader, we can see that it’s Ed Walsh. It’s an improvement, right? The key is in the WHERE clause, which matches up a column from one table with the other. The equivalent explicit join is: SELECT p.playerID, Name, ERA, IP FROM (SELECT playerID , SUM(ER)/SUM(IPOuts/3)*9 AS ERA , SUM(IPOuts/3) AS IP FROM pitching GROUP BY playerID HAVING IP >= 1000) p JOIN (SELECT CONCAT(nameFirst," ",nameLast) AS Name , playerID FROM MASTER) m ON p.playerID = m.playerID ORDER BY ERA ASC LIMIT 50; You want to be careful when writing the WHERE or ON clause of a join—if you aren’t careful, you’ll end up doing what’s called a Cartesian join, or essentially joining every row in one table to every row in another. For those of you keeping score at home, that would mean joining 39,016 rows from the pitching table to 17,264 rows from the master table, giving you… 673,572,224 rows as a result. You do not want that. We’re far from done with joins—they’re quite possibly the most powerful (and most dangerous!) tool in your SQL arsenal. But for right now let’s move along to some other things. We’ll be back. Dealing with data types I want you to see this at least once before it creeps up on you by accident. I do not, on the other hand, want to be the sort of tutorial writer who hands you a code sample and then, when it doesn’t work, shouts “Gotcha!”; So I’ll be clear with you: This doesn’t work right. Let’s say you want the top 50 player seasons with more runs scored than runs batted in. Try running this: SELECT playerID , yearID , teamID , R-RBI AS R_RBI FROM batting ORDER BY R_RBI DESC LIMIT 50; We are being lied to here. Nobody, but nobody, has 18,446,744,073,709,551,615 more runs than RBI in a single season. I don’t even know how to say that number. The problem is with data types. SQL has an awful lot of data types, but we’re only going to look at a few of the more common ones. Text is generally stored as “varchar,” or “Variable Character Field.” What that means is simply that you can store any length of text in there up to the size of the field. There are some other text formats, but nothing to really be worried about right now. There’s a much greater diversity of number formats, which can be broken down into three types: Integers, or whole numbers. Decimals, for precise fractional values. Floats, for approximate fractional values. These can be further subdivided: you’ll see values like “tinyint”; or “smallint”; thrown around, for example, as ways of describing the size of the number you can store in that column. There’s also the issue of signs, which becomes important here. For SQL, what we mean by signs is the plus/minus sign—whether a number is positive or negative. You can store numbers in MySQL either as “signed”; or “unsigned”;. Unsigned values cannot be negative. The Baseball Databank stores all numbers as unsigned. (Why? Because it saves storage space and time.) So in order for you to deal with unsigned numbers where values may be negative, you may have to change the datatype before doing any calculations. We do that with a CAST function: SELECT b.playerID , b.yearID , b.teamID , CAST(R-RBI AS SIGNED) AS R_RBI FROM batting b ORDER BY R_RBI DESC LIMIT 50; The heart of the CASE There are times when you want your query to be able to make some decisions on its own. Let’s say , for instance, that we want to look at pitchers between the ages of 25 and 29, post WWII. Remember, a player’s age is generally listed as his age come July 1. So, a simple little query here: SELECT p.playerID , m.Name , p.yearID , (CASE WHEN m.birthMonth < 7 THEN ( p.yearID - m.BirthYear ) ELSE ( p.yearID - m.BirthYear - 1 ) END) AS Age , p.ER/p.IPOuts/3*9 AS ERA , p.IPOuts/3 AS IP FROM pitching p , (SELECT CONCAT(nameFirst," ",nameLast) AS Name , playerID , birthMonth , birthYear FROM MASTER) m WHERE p.yearID > 1943 AND p.playerID = m.playerID GROUP BY playerID HAVING Age BETWEEN 25 AND 29 ORDER BY ERA ASC LIMIT 50; The CASE statement allows you to have different outputs based upon the underlying data. This could have easily been written as an IF statement as well: , IF(m.birthMonth < 7 ,( p.yearID - m.BirthYear ) ,( p.yearID - m.BirthYear - 1 )) AS Age I like CASE because it feels cleaner, personally. You can also use multiple WHEN statements inside a CASE. Saving your work Let’s say you’ve written an amazing, bad-ass query. You know you’re going to want to look at this one later. You have two options: CREATE VIEW view_name AS CREATE TABLE table_name AS Simply put either one of those at the front of your query. So what’s the difference? A view: Updates whenever the underlying data does. Preserves the query used to generate it. Creating a table: Does not update to incorporate changes in data. Does not run the query every time it’s called. A view is more flexible; a table is quicker. Also, I recommend creating a new database to store tables in—multiple databases if you have multiple projects. So let’s say we wanted to start a new database for a project that does Marcels projections. Right-click in the pane on the left and select “Create Database.” Call it Marcels and click on it to make it the active database. To run queries on tables in the Baseball Databank, simply prefix the names of the tables with bdb., like so: bdb.batting bdb.pitching Well after that little slip, it would hardly be fair to send you on your way without some Marcels projections, wouldn’t it? First off, I just want to thank Tangotiger, who came up with the Marcels, and Sal Baxamusa, whose spreadsheet versions of the Marcels were invaluable as learning aids for me. I tried my best to reimplement the Marcels as faithfully as possible, but there are (likely) differences between Tango’s work and mine. All credit goes to him and Sal, all blame for mistakes and errors goes to me. Here’s the code for hitter projections. Have fun with it, play around with it. Try to make it better, try to break it, try to take it apart and see how it works. That should be enough to get you started. If you have questions, don’t be afraid to ask! And most of all, have fun. References & Resources If all of this seems a little overwhelming, can I recommend the Baseball Reference Play Index? Thanks to the Simple-Talk Code Prettifier, which I used to format the code samples in the article. Some helpful resources: The official MySQL reference manual W3Schools SQL Reference Tizag’s MySQL tutorial FirstSQL’s SQL tutorial A great visual breakdown of the different types of joins possible. Links to several different MySQL cheat sheets Some helpful mailing lists: RetroSQL, which is focused on Retrosheet play-by-play data but also covers general baseball-related database questions, and the Baseball Databank mailing list. BaSQL is a website I set up to share code samples. There should be a lot of resources for you to try there. And if you come up with code you’d like to share, please feel free to post it. Ready to move up to a full play-by-play database? Try this on for size.