Building a Robot Umpire with Deep Learning Video Analysis

Robot umpires might one day replace human umpires thanks to video analysis. (via Keith Allison)

Today, there is a strong partnership between technology and sports. Almost all professional sports teams use technology in some capacity to help them win games. In the same vein, almost all professional sports leagues use technology in some capacity to help referee and manage the games. In baseball specifically, there has been video instant replay in use since the 2008 season, with the most current system (with manager challenges) in place since the 2014 season. Some people have proposed taking things to the next level, and using computers to replace some or all on-field umpires.  Here is an article that surveys the current state of electronic umpires, and also includes opinions of some players and coaches, and here is an article where MLB Commissioner Rob Manfred discusses his views on the topic.

This post will explore building a robot umpire that calls balls and strikes based on actual game video, using a technique known as Deep Learning.

Background on Deep Learning

Before talking specifically about Deep Learning, let’s first discuss Artificial Intelligence and Machine Learning. Artificial Intelligence (AI) is a broad field that attempts to build machines that can mimic cognitive functions typically associated with the human brain, such as learning and problem solving. Machine Learning (ML) is a subset of Artificial Intelligence that attempts to use statistical techniques to allow the machine to learn with data without being explicitly programmed. Although AI and ML have existed for over 50 years, their usage and importance have exploded in recent years, as the range of data and have access to it have increased.

Deep Learning is a subfield of Machine Learning. One common model/representation used in ML is the “neural network,” which is inspired by how the human brain is laid out and operates. A traditional neural network consists of three “layers,” including one input layer, one hidden layer, and one output layer. Deep Learning takes this model further by allowing for many hidden layers, which in theory allows it to perform much more complicated and intensive analysis compared to what traditional ML can do. Deep Learning has taken off in recent years due to the greater computing power now have available (especially with the rise of cloud computing).

One nice property of Deep Learning is that it can operate on “raw” data. Most traditional Machine Learning techniques require an operation called “feature extraction” that computes specific properties from the raw data. For example, to build a robot umpire that analyzes game video with traditional ML, it would probably be necessary to write some code that detects the position and speed of the baseball as it crosses home plate. Then, with this position and speed data in hand, build a ML model and use that to make predictions. Deep Learning does not require this feature extraction step, as the system automatically detects the features it deems to be important on its own. In this case, game videos can be fed into the system as input data, and Deep Learning will generate a ball or strike call.

An important consequence of skipping the feature extraction step is that it makes Deep Learning models and analyses “end-to-end systems.” This means that the system will take into account all visible features in the data, whether they are intended to influence the outcome or not. For example, it is well known that the height of the batter and the pitch framing ability of the catcher will influence whether the pitch will be called a ball or a strike. In a traditional ML system, only writing code to detect the position and speed of the ball would ignore the effects of the batter’s height and the catcher’s pitch framing. In that case, additional code would be needed in order to take those factors into account. Deep Learning will automatically utilize these bits of information.

Technical Details

The game videos used here are sampled from the 2017 regular season. They were recorded at a resolution of 640×360 and a frame rate of 30 FPS (frames per second). Since an on-field umpire does not have have a strike zone overlay during the game, the videos were checked to verify that they do not have visible overlays either.

There is pitch-by-pitch documentation of each game, and from this documentation it’s known whether the batter took each pitch, and if he did, what was the call made by the umpire. This information tells us whether each particular pitch should be included in the study, and if so, what label (ball or strike) to give for that pitch to the Deep Learning system.

The code was written in the Python language, and the Deep Learning library that used was Keras, which in turn uses the TensorFlow library. In modern programming practice, instead of starting with a blank slate, you often start with some open-source code someone else wrote for a similar project. There is a useful GitHub repo here, which is accompanied by this informative Matt Harvey article. Having this starting point saved countless hours that would be required to write all the code from scratch.

Deep Learning libraries typically run at optimal speeds when a GPU (Graphics Processing Unit) is used. GPU’s are very good at performing large numbers of computations in parallel, which is exactly what is done in Deep Learning. For this project, two PC’s were used, each equipped with a Nvidia GTX 1080 GPU, which is currently the second most powerful consumer-grade GPU available for Deep Learning.

Video Pre-processing with Pose Estimation and Onset Detection

Some people think that the most difficult job in building a Deep Learning system is writing the code to do the computations. In actuality, that is almost never the case, as the code to do the computations was already written by the people who wrote the libraries (in this case Keras and TensorFlow). The code that most people write is actually to set up the Deep Learning library’s pipeline to do the calculations, which is a much easier job.

In most Deep Learning/Machine Learning systems, the most difficult job is to pre-process the data to put it into a form that the DL/ML library can understand and digest. In this case, feeding entire pitch videos into the system would certainly overburden the computers, and slow the system down so much it would take years for the study to complete. The start and finish of each pitch need to be identified, and the videos cut down to the split second just before the ball hits the catcher’s mitt. While there are many ways to go about this task, there is no magic formula for an optimal solution. Many hours in trial-and-error mode were required to find a reliable solution.

The solution I eventually settled upon uses pose estimation followed by onset detection. Pose estimation is the process of estimating the orientation of an object in an image/video. In this case I am using it to estimate the pose of the pitcher, to determine when his arm and/or leg is extended far away from his body. This signifies that the pitcher is in his windup or follow-through motion, and he is either about to release the ball or has just released the ball. Just like with the Deep Learning libraries, there is a freely available program that implements pose estimation called OpenPose, which is available here. Note that this step is by far the slowest step in the process. Running pose estimation on all of the videos recorded took over two months.

Onset detection refers to finding onsets, which are the beginning of sounds. Once I am able to detect the pitcher going into his windup or in his follow-through, it is presumed that from that point on, the next loud onset would be the sound of the ball hitting the catcher’s mitt or the dirt around home plate. Part of the code attempts to filter out the commentator’s speech, to prevent the speech from being mistaken for the ball’s contact sound (this filtering process was done with a digital signal processing high-pass filter with a cutoff frequency of 7500 Hz). I found a good implementation of onset detection here.

A Hardball Times Update
Goodbye for now.

Finally, I used the handy FFmpeg utility to cut the original video down to the 0.4 seconds just before the end of the pitch (which results in 12 image frames per pitch), and also cropping the video to a 300×300 pixel box around the batter, catcher, and umpire (this resolution is required for the Deep Learning pipeline I chose).

Here is an illustration of the pipeline I just described to you. This is one of the initial pitch videos I downloaded:

This is the video marked with OpenPose’s detected poses:

This is the frame where the pitcher’s follow-through was detected. You can see the pitcher’s high leg kick, and that the ball is in the air between the pitcher and the catcher (it is just above the catcher’s glove):

Here is a visualization of the audio track in this video, with the detection of the pitcher’s follow-through marked in yellow and the ball hitting the catcher’s glove marked in red. In this example, the sharp spike associated with the ball hitting the catcher’s glove makes it an easy job to detect this event, but in other examples it is not so obvious to the naked eye:

Here is the final truncated video (don’t blink, otherwise you’ll miss everything):

With this system, I was able to detect the majority of pitch start/finishes, while keeping the false detection rate at around one percent. For pitches where the system was not confident about the result (which could have been because the windup/follow-through was hard to find, or the sound of the ball was hard to detect, or both), I opted not to include the video in the study to minimize the chance of including a truncated video that does not contain a pitch.

Even excluding these videos, I was able to collect truncated videos of over 80,000 strikes and over 100,000 balls. To make my life easier, I opted to further sample the videos and match the number of ball and strike videos used. In the end, there were slightly over 160,000 pitches used in the study, with a roughly equal number of balls and strikes. In addition to having the nice property of equal numbers of strikes and balls, this also cut down on the PC processing time, although this admittedly comes with some loss of rigor.

Neural network architecture and detected features

I chose to use architectures most similar to methods four and five in the above mentioned article. In both methods, two separate neural networks are used. The first neural network’s job is to extract the information from each individual video frame. This is done using a convolutional neural network (CNN). Building a CNN from scratch is a very difficult and time consuming process. So going along with leveraging the publicly available ideas and code of others, I used “transfer learning” with the publicly available Inception V3 CNN, which was created by Google. Whereas using Inception will typically yield predictions at its output layer, I modified the output layer slightly to output the detected features instead.

Here is an sampling of the information present at successive layers of the Inception CNN:

Original video frame:

Layer 1 (32 images at 149×149 resolution):

Layer 10 (64 images at 73×73 resolution):

Layer 50 (256 images at 35×35 resolution):

As you go further and further into the many layers of Inception, you will have more images per layer, and the images will be smaller in size. At the first layer, the images mostly look like different shades of the original video frame, and you can still clearly see the location of the baseball. However, as you go down the layers, the look of each image quickly becomes more abstract to the human eye. It is not uncommon to have some images be almost completely blank (such is the case of the second and fourth images in layer 10). To the Deep Learning system, however, these small, abstract images still carry meaning. At the final layer, each of the 2048 “images” gets reduced to a single number, and thus the 2048 element “feature vector.”

These features that were detected by Inception were then used as input to the second neural network. The second neural network was either a long short-term memory (LSTM) or a multi-layer perceptron (MLP). The LSTM is a specific type of recurrent neural network (RNN) that is well-designed to handle inputs with a time dimension (such as exists here with the successive frames of each pitch video), while the MLP does not specifically define a time dimension, and simply concatenates all the feature vectors together to form the input. Note that there is no such thing as a “correct” or “best” implementation of an LSTM or MLP, so choosing a network topology (including factors such as the number of neurons used) is an experimentation process. Using a model that is either too simple or too complex will lead to different sets of problems.

The output of the second neural network is a single number between 0.0 and 1.0 that indicates the probability that the system believes the pitch is a strike. Note that if you subtract this number from 1.0, you will get the probability that the system believes the pitch is a ball.

Training/Test Methodology and Results

When creating any Deep Learning/Machine Learning system, there is always a training phase and a test phase (sometimes there are additional phases). In the training phase, you use a subset of your data to try to find the optimal parameters of the system. Then in the test phase, you use the remainder of the data to evaluate how well the system performs on new data it did not previously encounter during the training phase.

I used a process called K-fold cross validation with K=5. This means I partitioned the data into five separate sets, and for each pass, used four of those sets (thus 80 percent of the data) to perform training, and the last set (20 percent of the data) to perform testing. This process is repeated four times (for a total of five passes). In each pass, a different set is left out of the training stage in order to be used as the test set. The training is an iterative process; in each “epoch”, an attempt is made to further minimize the value of a defined “loss function” (a measure of how “wrong” your predictions are), which usually leads to higher prediction accuracy. Thanks to the pre-processing that has been performed, each run only takes one to two hours. After these five runs are completed, a composite average of the results gives the overall performance.

The LSTM network yielded slighter better results than the MLP network, which is perhaps not surprising, as it is the same result published in the mentioned article (although in my case the results were closer to each other). The LSTM I used for these results used three layers of LSTM cells, with 64 hidden units per cell. In addition, there are a bunch of “hyperparameters” that you can tune to your liking; I used the Adam optimizer with a learning rate of 2e-4 (or 0.0002). The best epoch’s accuracy rate in each of the five runs was 64.5 percent, 65.2 percent, 63.8 percent, 64.6 percent, and 64.3 percent, thus yielding an average accuracy rate of 64.5 percent.

Note that there are two possible errors that the system can make:

  • Mistaking a ball for a strike
  • Mistaking a strike for a ball

There is no certainty these two separate error rates will be similar. Here is a “confusion matrix” for one of the models generated for the system (the one that scored 65.2 percent accuracy):

Confusion Matrix
Deep Learning’s call (down) / Umpire’s call (right) Strike Ball
Strike 10642  5695
Ball 5705 10652

Here is how to interpret the confusion matrix:

  • 10,642 cases where the system agreed with the umpire’s strike call
  • 10,652 cases where the system agreed with the umpire’s ball call
  • 5,705 cases where the umpire called a strike, but the system called a ball
  • 5,695 cases where the umpire called a ball, but the system called a strike

The correct strike rate is 65.1 percent, with 10,642 correct out of 16,347 (10,642+5,705) total strikes, while the correct ball rate is 65.2 percent, with 10,652 correct out of 16,347 (10,652+5,695) total balls. Thus this model seems to be equally good (or bad, depending how you look at it) at identifying strikes and balls.

If a strike is thought of as a “positive” event and a ball as a “negative” event, precision and recall can be computed, as can the F1 score:

  • Precision = True positive / (True positive + False positive) = 10642/(10642+5695) = .651
  • Recall = True positive / (True positive + False negative) = 10642/(10642+5705) = .651
  • F1 score = 2 * Precision * Recall / (Precision + Recall) = .651

These results were taken with a probability threshold of 0.5, meaning that if the system output is higher than 0.5, it is categorized as a strike, and if the output is lower than 0.5,  it is categorized as a ball. It is possible to tune this threshold, which will change the values just calculated.

Note that even though the training stage may have taken one to two hours to run, the test stage where the ball/strike calls are made is very fast. The calls for the entire confusion matrix (over 32,000 calls) were made in just a couple seconds, so a single call would only take a fraction of a second. This means that the system would be able to run as a real-time system.

Putting the results into context

Let’s try to put this 64.5 percent average accuracy rate into context. Each pitch has two possible outcomes, ball or strike. Since the data set has equal numbers of balls and strikes, guessing randomly will achieve 50 percent accuracy. Taking this into account, a 64.5 percent accuracy rate doesn’t seem like such a great improvement.

At the same time, human umpires are not 100 percent accurate when making their calls. The Deep Learning system is trying to match each pitch to what the umpire’s call was, not whether the pitch was a ball or strike based on the theoretical ideal strike zone. The umpire’s call is being used as the baseline for truth. Therefore this system should not be expected to match the umpire’s call 100 percent of the time. The best I can hope is that it will perform as well as a “human expert” does.

Most studies of umpire accuracy compare individual umpires to league average. I was curious how I would perform, so I gave myself a test to see how well I could call the zone. After viewing 1000 truncated pitch videos (500 each of balls and strikes), I managed to attain an accuracy level of 87.2 percent. This study shows that the variance between human umpires can be over 20 percent, so it is not surprising that my calls disagreed with the umpires 12.8 percent of the time. I cannot expect 100 percent accuracy for a computer, since different human umpires have a different definition of the strike zone. Perhaps the best I can hope for at this time is a 90 percent agreement between the computer and the aggregate human umpire.

Here is my confusion matrix:

Confusion Matrix
Human’s call (down) / Umpire’s call (right) Strike Ball
Strike 433 61
Ball 67 439

This translates to a 87.2 percent accuracy rate, 86.6 percent correct strike rate, 87.8 percent correct ball rate, .877 precision, .866 recall, and .871 F1 score.

More thoughts

Here are some other things to consider:

  • Deep Learning is a hard problem. It was only about 10 or 15 years ago when the first systems were able to differentiate between images (not videos) of dogs vs. cats, which is a much simpler problem than differentiating between videos of balls vs. strikes. While academic and industry researchers have made significant improvements since that time, even today’s most state-of-the-art systems are sometimes still not as good as a human expert.
  • Deep Learning requires a lot of training data. It is not uncommon for modern systems to use millions (and in some cases hundreds of millions) of training samples. While I was able to collect a good amount of data, it was not in the millions range. I bet the system can perform better with more training data. As a test, I tried building the system using only about 20,000 videos for training (10,000 strikes and 10,000 balls), and it attained about 59 percent accuracy. So using the the full data set does improve the result.
  • I did not take the previous history of the at-bat and game into account. All pitches were treated equally with the same standards. In real games, umpires will have certain tendencies. For example, if an umpire was calling low strikes in the first inning, he will probably continue to do so for the rest of the game; therefore it would probably help to build that kind of context into a Deep Learning system.
  • The videos are somewhat flawed. They were designed to look good on a baseball fan’s TV/computer monitor, and were never designed to be used for building a robot umpire. Each stadium has the video camera (possibly different cameras in different stadiums) set up in a slightly different location in the outfield, thus yielding a slightly different viewpoint in the resultant video. Sometimes the pitcher’s body will obstruct the view of the ball or home plate. If you went to MLBAM’s engineers, and asked them to design a video camera/feed that is optimized for building a robot umpire, they would likely give you something very different from this. It would certainly be a higher resolution/frame rate video, with a more optimal and standardized viewpoint, not to mention taken from a differently engineered camera (it may very well be a 3D camera).

To expand upon that last bullet point, given that I am a hobbyist doing this project in my spare time with computer equipment paid for out of my own pocket, I am sure that MLBAM could do a much better job with a team of full-time engineers, access to all of MLB’s resources, and a much deeper wallet.

This study offers another method for creating a robot zone. While it does not appear likely to improve upon zone calls now, my hope is that by demonstrating the method is viable, a means of improving upon on-field calls becomes possible in later iterations. One way to improve the system is to compute the optical flow images for each successive pair of images in the videos. Optical flow images will show the direction of movement of the objects in the video, and they can be helpful in indicating the movement of the baseball and catcher’s mitt.


This project was borne of interest in an area of research I found lacking. I couldn’t find any published articles on this topic, so I decided to give it a try myself. The results suggest both that Deep Learning could call a strike zone, and that the results would benefit greatly from better quality videos and equipment. There have been some very good articles published which successfully used PITCHf/x or Trackman data to build a robot umpire (such as this article which chronicles using a robot umpire in an independent league game).

The results of those studies suggest that method is farther along. It would be interesting to see if a Deep Learning system with more videos, better videos, and better computing infrastructure could equal or perhaps outperform the performance of those systems. In the meantime, this method offers another possible means by which strike zone calls could be improved. I look forward to fine-tuning this method in the months to come.

Roger works as a software engineer by day, writes for The Hardball Times and FanGraphs by night, and has also worked for a Major League club.
Newest Most Voted
Inline Feedbacks
View all comments
4 years ago

even this ai would learn to be biased against aaron judge. feelsbadman

4 years ago

Fascinating article Roger.

I did have several questions on how your software platform would work, and on how such a system would be implemented:

1) Would the Pose estimation software you used work with side-armers in addition to pitchers throwing over the top? If not, what effort would be required to make that work?

2) In addition to balls/strikes home plate umpires must make foul tip, swinging strike calls. Does the onset detection software you utilized handle that?

3) How would you envision a robotic umpire being utilized in a game? Would everyone be expected to turn to see a CF scoreboard on whether a call was a ball or strike? From a speed/flow perspective, how would the software work calling a ball/strike on a 3-2 count with a runner on first running (a catcher always wants a quick, clear voice signal on the call so he knows whether a throw down to second is required)?

4 years ago

Why can’t a chip be put into baseballs and sensors placed on the edge of the plate? Ideally, also have sensors sewn into the knees and letters of uniforms. Theoretically, the sensors would allow to triangulate the position of the baseball going through the zone and improve accuracy.

Kyle Boddymember
4 years ago
Reply to  Roger Cheng

This has already been done (DiamondKinetics) and the results are good but not better than PFX / Optical tracking at its best. In an ideal world, optical > radar. There’s no reason Trackman should be more accurate than PFX.

4 years ago
Reply to  capconstrained

The idea of sewing a chip into the uniform presents opportunities for variation. If I’m a batter, I might bunch my pant legs up a little to raise the bottom of the strike zone.

The most interesting element I see in this exercise is the pose detection. If implemented well this could be done in real-time, solving one of the biggest challenges of the RoboUmp – the vertical limits of the strike zone. Estimating a player’s skeletal frame is much more accurate than relying on someone setting the limits for each pitch or using sensors on the body.

4 years ago

Thank you for sharing this fascinating and impressive work. However, I do not understand the basic motivation. If the training is based on calls made by human umpires, how can the result, even in principle, improve upon that standard?

If the training were instead to be based, say, on some computerized zone that ignores things like the count and catcher’s movements (as we now have but do not use officially), what advantage would it provide over just using that computerized zone in the first place?

Jetsy Extrano
4 years ago

Fascinating exercise!

You mention you’re using the humans as the ground truth — one thing about that is that the humans appear to be heavy users of a pre-pitch prior of whether a strike is likely. At 3-0 the pitcher’s interests are tilted toward a strike, so the called zone enlarges from that prior.

The better a robot ump gets at judging the ball, the less this prior will matter. Calling perfectly will be a rather different game, with more walks and strikeouts, though pitchers may adapt. Smaller zone at 3-0, bigger zone at 0-2.

For reference I wonder how a naive model would perform that didn’t see the pitch at all, only knowing the count.

4 years ago

I have a hard time believing there isn’t a machine already that can call balls and strikes in an accurate, reliable, consistent and verifiable manner far superior to humans. I mean, this technology has been developing for twenty years or more.

Kyle Boddymember
4 years ago
Reply to  RWinUT

The problem is very difficult. Just because 20 years have passed doesn’t make anti-gravity machines more possible. We are still in the very early stages of computer vision.

Kyle Boddymember
4 years ago


It is strictly against the license of OpenPose to use it for sports applications that are non-free + commercial. This should probably be included in your article as a pretty important note for MLB teams out there (as well as analysts who would not contribute back).

Otherwise, very interesting approach and one I’m quite familiar with and look forward to reviewing a bit closer for some of the other things we do in our lab!

4 years ago

I really dislike that catcher framing is at all relevant. The rulebook makes no reference to framing; therefore framing cannot matter.

4 years ago
Reply to  Llewdor


Does the rulebook make reference to defensive shifts? Hit-and-run? Knuckleballs? Shortstops mimicking ground-balls in order to fool baserunners?

The word ‘strategy’ refers precisely to measures taken in response to the finite coverage of rulebooks.

By assigning responsibility to a human umpire, as opposed to some artificial detector, for calling balls and strikes, the rulebook implicitly invites teams to employ good framers. (in the same way that by not dictating where defenders shall play, it invites defensive shifts)

Bobby Ayala
4 years ago
Reply to  mgwalker

Framing is tricking stupid/unobservant umpires into making incorrect calls. It shouldn’t factor into ball/strikes at all.

4 years ago
Reply to  Bobby Ayala

You vastly underestimate the skill, dedication, training and professionalism of MLB umpires.

4 years ago

I’m all for whatever stops umpires calling strike 3 on Moncada when the pitch is clearly a ball.