Baseball Coding with Rust – Intro

This new(ish) programming language is an alternative to other programming languages. (via Public Domain)

A big thank you to Carol (Nichols || Goulding) for providing feedback on the Baseball Metaphor section. Carol co-wrote the Rust Book and runs the yearly Rust Belt Rust conference.

From time to time, major league teams will post job offers on FanGraphs. Most of these postings, if not all of them, ask for a level of proficiency in Python or R. While these languages have built up tremendous ecosystems, especially for data science, they are limited in the amount of data they can handle.

This is not a flaw in either language, rather a design choice. Without getting into the weeds too much about language theory, each language plants itself somewhere on the performance/ease-of-use spectrum. Nothing in today’s piece should be construed as a critique of Python or R. Quite the contrary. Python and R are the bedrock languages of the data science worlds.

Today, I would like to introduce you to Rust, a modern systems programming language that aims to be, in their words, “A language empowering everyone to build reliable and efficient software.” I can personally attest to this being the case.

I am not a programmer, let alone a systems programmer. The last time I coded something was creating an AI snake for nibbles.bas. If you know what that is, you’ll have a sense for how long ago that was. Late last year, I decided to teach myself a programming language that would allow me close access to the hardware and enable me to build my own personal baseball data infrastructure. It was critical that the chosen language would perform at the level of C/C++.

Why not just learn C, you ask? The challenge with C is that you need years of experience developing in it before you can write programs that don’t have serious errors. Everything I had read about C told me it was a bad idea for me to learn C. It figuratively had a neon sign saying, “Amateurs Need Not Apply.”

Along my winding research path, Rust piqued my interest. At this point, being a non-programmer, the concept of “memory safety” didn’t mean much to me. In fact, I didn’t really understand what exactly the issues with C/C++ were. Rust promised the performance of C without the problems of C. I wasn’t convinced until I watched a talk by Sergio Benitez entitled “A Case for Oxidization.”

Sergio’s talk convinced me Rust was the right language for me to learn. Sergio, from the bottom of my heart, thank you. Your talk inspired me to learn Rust, and learning it has been an absolute joy and has empowered me to build things I never thought I could.

A Baseball Metaphor for Rust vs C/C++

Before diving into code, let’s discuss what makes Rust unique, specifically Rust’s ability to manage memory through static analysis (analyzing your code before it runs or compiles). We’ll use an abstract baseball play to compare the Rust model vs. the C/C++ model. You’ll hear the term mutable a lot. It simply means “can be changed” as opposed to immutable, which means never can be changed.

Let’s start by defining our ball, using Rust syntax:

This code defines our simple ball as having a location and a velocity, each with an (x,y,z) value as well as a spin. All of these are measured as 64-bit floating-point numbers. In a real application, we’d likely have a host of other variables. For the purpose of this metaphor, we’ll keep it simple.

Every player in the game will need mutable access to the ball. This means every player must have the ability to change the ball, specifically the velocity and spin components. How we manage this mutable access can be critically important.

In both models, we’re going to have a variable ball that is created before every play and destroyed at the end of every play.

A Hardball Times Update
Goodbye for now.

In the C/C++ mental model, we can give all of our players mutable access to the ball at the same time. We’ll construct our logic in a way that will make sure that only one player can change the ball at any given point in time.

Let’s say the ball is fouled out of play and caught by a fan. Who is responsible for telling our game the play is over, the ball needs to be destroyed, and a new one needs to be made? We’ll need to add some logic to the game that indicates the play is over for any possible play. When “end of play” is triggered, we’ll destroy the ball and create a new one for the next play.

It is easy to imagine a scenario in which the game logic will grow sufficiently complex that we’ll forget to destroy the ball at the end of the play, ending up with multiple balls. Alternatively, we may incorrectly assume a play is dead and delete the ball prematurely, creating a dangling pointer. The chances for error increase when different people are responsible for coding the pitcher, batter, fielder, etc.

In Rust, the compiler will check our code and make sure that at most one thing can modify the ball at any point in time. We do this by moving the ball into the pitcher function or the batter function. The compiler will enforce that the ball is only ever in one place. Once the ball is no longer in use, the ball will automatically be destroyed. Further, even if we have multiple people coding different parts of our game logic, the Rust compiler will make sure all those pieces compose together. This makes composition easy, as opposed to fraught with potential error.

This is all handled by analyzing the code for our game. The genius of Rust, and what makes it unique, is that all of this is done at compile time, before any code is executed. It gives developers, especially hobby developers such as myself, tremendous confidence that our code is memory safe.

Rust is Blazingly Fast

Rust runs extremely fast and performs on par with C. In my non-expert opinion, as time passes, Rust’s static guarantees will allow it to surpass C in performance across a variety of workloads. Most importantly, it is a living, breathing, actively-developed language. The same code you wrote six months ago will get faster as better algorithms get baked into the standard library, or a library your software depends on. There is also a fantastic ecosystem of enterprise-class, free, open-source libraries that integrate seamlessly. We’ll be using these extensively as we dive into the code.

Let’s Build a Real-Life Rust Baseball Application

If you’d like to follow along, go to the Rust homepage and follow their “Getting Started” instructions. It’s also a great starting point for learning resources. Otherwise, just sit back, relax and enjoy some Rusty baseball code. If you see the term “crate,” it’s simply Rust’s term for a “module” or “package.”

One of Rust’s strength is its ability to compose pieces of software easily. I use Visual Studio Code for code editing. It’s totally free and very easy to use. I’m going to skip all the setup steps and assume you have a bare-bones “Hello, world” application setup with a main.rs and a cargo.toml file.

The code we’ll be going through may not represent “idiomatic” Rust code. Idiomatic is programming jargon for the ideal way, or pattern, to code something. I couldn’t get the code to format nicely, so I’ve pasted it all as pictures

There is no greater public database in the world than the MLB Gameday XML files, other than maybe the StatCast data hosted on Baseball Savant. These XML files contain pitch-by-pitch data for every pitch thrown in affiliated baseball since 2008. While these files are no longer supported, they are a trove of delicious data. We’ll start by finding all the games for a particular level for an arbitrary date and turn them into a set of links we can use.

Crate to Request Data from the Internet

If we’re going to download data from the network, we’ll need the reqwest crate. All we need to do to use this crate is to add reqwest = "0.9.18" to our [dependencies] in the Cargo.toml file that was created by cargo and use reqwest; to the top of our main.rs file.

We’ll need to construct the URL we’ll be using. To do this, we’ll write a simple function that takes four string slices (text inputs) and outputs a String. For more info on how Rust handles text, read Chapter 8.2 of The Rust Book. If you are new to Rust, don’t worry too much about the difference between &str and String. It will make more sense once you’re well on your way.

Building The URL

We’ll write a simple URL construction function:

In Rust, our function signatures act as a contract. This signature enforces that anything that calls (uses) the function must give it four string slices as input. The function is then guaranteed to always return a String. The syntax for a function is always fn followed by the name of the function (game_day_url in our case). This is then followed by parentheses () that may or may not include any inputs to the function. If the function returns something, the parentheses are followed by ->, which means “returns” and is followed by the type of data that is being returned (String, Integer, a Struct you defined). Everything in the {} braces is the function body.

The last expression (“line”) in the function body is implicitly returned. This means any expression at the end of the function will be the value the function sends back. This function does something very simple: It just concatenates (adds text to text) the Gameday base URL with the level code (mlb, aaa, aax…), the year, month and day.

Our game_day_url function will return something like this:

http://gd2.mlb.com/components/game/mlb/year_2018/month_06/day_10.

Extracting the Game Links into a List of Links

When we deal with detailed XML files, specifically the pitch-by-pitch data, we’ll spend the time to do proper de-serialization. For now, we’re going to rely on iterators. Iterators are one of the most powerful programming patterns. An iterator takes any list of values and builds a series of functions that get applied to each value. This enables chaining a lot of operations in a manner that is very easy to read and understand. Here’s our game_day_links function:

This function takes a string slice and returns a Vector of Strings. A Vector allows you to store a list of items of the same type. In our case, we either want to return a Vector with all the links or an empty Vector.

Our first line: let resp = reqwest::get(url); sends out a request to the network that will return a Result. The Result type is an Enum (short for enumeration), meaning it can be only one of its enumerated types. Results for the request are either a Response or an Error. Before we “unwrap” the Response from the Result, we’ll first need to make sure we got a Response. If we received an Error, we’ll need to handle that.

The if resp.is_ok() checks that we indeed got a Response. If we didn’t, the code will pop down to the else {} clause and return an empty Vector, using the vec! macro. We’ll now unwrap the Response from the Result with let links = resp.unwrap()and then get the text of the Response, which will also return a Result. We will unwrap this Result with an unwrap_or, which will either give us the text or an empty String. The links variable should now contain all the html from the URL.

The links.split("<li>") takes the text and splits it into an iterable list of items. Every time it sees “<li>,” it will create a new item. We only need items that have gid_, so we’ll apply a filter function to each line .filter(|line| line.contains("gid_")). The |line| creates a closure on each item, which in my brain translates to “take the entire item and call it line.” We’ll then pass only the items that contain gid_ to the map function.

Maps take each item in the list and map it to a new value. Our map function will include its own iterator that we’ll use to decompose the text into a Vector. The line url.to_string().clone(), the .clone() part is key, since the URL will “live” only as long as the function. If we’re building a list that will outlive the function, we can’t keep a reference to the original string; we’ll need a clone of it. In plain words, once the function is done, all the variables in it will be dropped. (They won’t be around anymore.) If the item we’re returning is depending on those values, it will be returning something that doesn’t exist anymore. We deal with this by explicitly cloning each item into our return value.

Since the construction of each link is the same, we’ll simply split it into a sub list when we see a “<” or a “>” and then take the third item of that list. In most programming languages, the first item in a list is always item 0, so the [2] indicates we’re taking the third item. We do this by collecting the list into a Vector of string slices (<Vec<&str>>), taking [2] and then trimming any white space.

After the filter and map, we’ll then collect this clean list into a Vector of Strings.

Putting it All Together

Let’s use our functions we created. In our main () function, we’ll add three lines of code:

The first line creates a URL with the inputs above. The second line creates a variable games, which will be a Vector (list) of the game day links for that URL.

The dbg! macro will show us what games looks like when we run the program.

We’ve just built a simple utility that turns an arbitrary URL into a list of links we can use. While that may not seem like much, we did it using a systems programming language. In the next piece, we’ll extend this further and show off some off Rust’s more powerful features, including parallel processing.

Closing Thoughts

Rust claims to be a language that empowers everyone to build reliable, efficient software. I can only speak for myself, but Rust has truly empowered me to build things I never thought I could, both baseball and non-baseball related. My hope is you’ll take the time to give a Rust a try, even if you’re a non-programmer, or coming from the JavaScript, Python, Ruby or R communities.

If you’re interested in getting started, begin with The Rust Book — make sure to read it twice. I highly recommend Exercism, once you’re ready to start trying out some code. Exercism was key for me in getting up and running. It’s 100% free and maintained by the various language communities. You’ll get more out of it in Mentor mode.

We’ve only scratched the surface of what Rust is and what it can do. Today, we introduced some core concepts and a little code. Part 2 will dive deeper into code, including parallel computation, as well as demonstrate some powerful, expressive features of the Rust language.

References & Resources


Eli Ben-Porat is a Senior Manager of Reporting & Analytics for Rogers Communications. The views and opinions expressed herein are his own. He builds data visualizations in Tableau, and builds baseball data in Rust. Follow him on Twitter @EliBenPorat, however you may be subjected to (polite) Canadian politics.
2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
D.K. Willardsonmember
4 years ago

Really great stuff, thanks!

I need to add another language like I need a hole in the head…but I’m in

BKhipsterball
4 years ago

You might be my favorite FG/THT contributor these days. As modelling the game heads towards more process-based inputs, it’s equally important to discuss the process by which our analysts both aggregate data and query said data. You and Podhorzer are the only two that really bring us under the hood with proper context.