Jump to content

HHP: Hard Data on Ball/Strike Calls - How Good/Bad are the Umpires


skanar

Recommended Posts

Well really if we're going to even have the conversation of computerized automated ball/strike calls in lieu of umpire calls, which I've seen discussed at length on this forum, we need to have a clear definition of what the strike zone even is, which I've seen no real discussion of. I did some digging and the rule book book doesn't seem to define it one way or the other in terms of how much of the ball needs to cross the strike zone to be considered a strike. But there is a lot of discussion on enforcement in MLB, where some umpires have decided to enforce the strike zone in different ways. Wouldn't the first step in all of this be to standardize the strike zone enforcement in terms of whether or not all of the ball, the majority of the ball, or any of the ball crossing the plane is considered a strike?

Any part of the ball crossing the plane should be a strike as far as I know.

Link to comment
Share on other sites

  • Replies 139
  • Created
  • Last Reply

I believe I found the flat file with all of the data referred to earlier.

http://gd2.mlb.com/components/game/mlb/

I can code an XML parser to gather the data from this and put it into a database, but I need some help interpreting the data. Is there a guide or something, somewhere, anywhere, that explains or defines the XML nodes? I think I've ascertained that in order to find the PitchFX data you have to go to the specific year, then month, then day, then select the game you want to look at, then select the "premium" folder, and then go to the "pitchers" folder.

This gives you a list of folders, one for each pitcher who pitched in that game. I did some digging and found that Josh Hammels ID in these files is 434628. I'm not sure if there is a way to figure out the name associated with the pitcher ID from the pitcher's files in their premium folder, but if there is, I haven't found it yet.

There are a lot of XML files in there. I count 25. I have no idea what they are or what data inside of them represents, but it's my best guess for where the PitchFX data is located. Pitch_Tendencies_Game is the most likely candidate, and it has the most data of all the xml files in a pitcher's folder... but I don't have the slightest clue what the data in that file means right now. It doesn't seem to have enough data in there to actually represent the PitchFX data. So I am truly at a loss. Also, I can't figure out what the difference is between most of the files... most of them, at a quick glance, seem to contain identical data.

Anybody have any idea where to start in interpreting this slew of MLB data they have available? I just can't wrap my head around it... the node names are too vague, I can't find a help file on what represents what anywhere...

EDIT: OK, not quite as simple as I thought... you can't look at it on a per-pitcher basis in the flat file, you have to look at it by going from the main game folder to the innings folder. I think you can look at it from the perspective of each inning individually or all_innings, but I don't know if there is a difference between the XML files all concatenated into one file, or if the all_innings file is missing some of the data. Have to look into it.

EDIT 2: Now I'm getting somewhere. Found this site with some explanation on the data: http://webusers.npl.illinois.edu/~a-nathan/pob/tracking.htm

Link to comment
Share on other sites

I believe I found the flat file with all of the data referred to earlier.

http://gd2.mlb.com/components/game/mlb/

I can code an XML parser to gather the data from this and put it into a database, but I need some help interpreting the data. Is there a guide or something, somewhere, anywhere, that explains or defines the XML nodes? I think I've ascertained that in order to find the PitchFX data you have to go to the specific year, then month, then day, then select the game you want to look at, then select the "premium" folder, and then go to the "pitchers" folder.

This gives you a list of folders, one for each pitcher who pitched in that game. I did some digging and found that Josh Hammels ID in these files is 434628. I'm not sure if there is a way to figure out the name associated with the pitcher ID from the pitcher's files in their premium folder, but if there is, I haven't found it yet.

There are a lot of XML files in there. I count 25. I have no idea what they are or what data inside of them represents, but it's my best guess for where the PitchFX data is located. Pitch_Tendencies_Game is the most likely candidate, and it has the most data of all the xml files in a pitcher's folder... but I don't have the slightest clue what the data in that file means right now. It doesn't seem to have enough data in there to actually represent the PitchFX data. So I am truly at a loss. Also, I can't figure out what the difference is between most of the files... most of them, at a quick glance, seem to contain identical data.

Anybody have any idea where to start in interpreting this slew of MLB data they have available? I just can't wrap my head around it... the node names are too vague, I can't find a help file on what represents what anywhere...

There are some tools available to load the data in a MySQL DB. I have not tried them, however.

http://baseballanalysts.com/archives/2010/03/how_can_i_get_m.php

I see two potential ways to go about this.

1. USe the tools to load the data into your own DB then have it at however you like: SQL, web app, whatever. The is the most flexible solution as you wan literally do anything with the entire set of pitch f/x data you want. But its way more upfront work and you have to maintain a DB plus a web application.

2. Write some parsers to run daily they download the daily data, process, and update a database containing only the results you are interested in. Easiest way to solve a single problem. But you would have to write new parsers and corresponding dB schema for each thing you want to do. This won't scale well.

I would do #2 (if I had time to do this) then migrate to #1 if I actually wrote enough scripts to think scaling is an issue.

Python is King!

Link to comment
Share on other sites

I'm looking at the dead center of the pitch box vs the line. This is the same as looking at the majority of the box. If the pitch is exactly evenly split by the line (this has happened maybe 2 or 3 times in 20 games), I've been giving the umpire the benefit of the doubt.

I don't think worrying about what the rulebook says is all that relevant; honestly, the analysis is just not precise enough to worry about this sort of fine distinction. It's important to remember that, as a result, differences of +2 or +3 pitches are almost certainly too small to really mean anything. CA-ORIOLE was suggesting that splitting things up into zones and then counting pitches in those zones would be better than relying on the PitchFX strikezone, and I agree, but initially I wanted to get through the O's first 15 games or so quickly to get some (preliminary, approximate) results.

I think you definitely have to break the strike zone down into multiple zones (internal and external) to get a true idea what's going on with the umpiring. I think you'd need at least 6 internal and 14 external zones. High/low has been tight for awhile. It seems to me they have loosened up the low calls in recent years and are calling them more often, but a lot of low (inside the zone) strikes simply aren't called strikes. It may be worse on the high strikes, but the high strikes are easily the hardest to define.

The other issue is pro-rating the calls instead of an up/down count, which might be very misleading imo. As an example in Chen's chart the other day, I think he had 3 balls just off (say 1"-2') the outside corner/middle and all three were called strikes. Chen got 100% of calls in that particular area on that day. Lets say as an example that the opposing pitcher got 6 calls in the same area and two were called balls. That would be rate of 75% as compared to Chen's 100%, yet Chen gets a minus 3 by your system. Now, it maybe that Chen threw more balls in the same area but they could have been fouled off, swung at and missed, or hit. To me, that's the level of detail that might be needed to get into this. Even with that I think you'll need much more data over the course of the year and the breaking down of umpires/pitchers, pitch types, framing etc. to get more detail/meaning.

Link to comment
Share on other sites

Any part of the ball crossing the plane should be a strike as far as I know.

It's actually a pentagonal prism, bounded by the points of home plate, by the bottom of the kneecaps low, and by the uniform letters high. (technically, the top of the zone is supposed to be the midpoint between the belt and the top of the shoulders, but in practice, this is right around the letters.)

Here's an image that accurately describes it.

http://upload.wikimedia.org/wikipedia/commons/8/89/Strike_zone_en.JPG

Link to comment
Share on other sites

It's actually a pentagonal prism, bounded by the points of home plate, by the bottom of the kneecaps low, and by the uniform letters high. (technically, the top of the zone is supposed to be the midpoint between the belt and the top of the shoulders, but in practice, this is right around the letters.)

Here's an image that accurately describes it.

http://upload.wikimedia.org/wikipedia/commons/8/89/Strike_zone_en.JPG

Interesting. Thanks. The high strikes zone is defined, but perhaps I would still say, a bit ambiguous. Base on your chart I think the only issue might be a ball that could somehow curve around into the strike zone with crossing the front plane. I'm not sure if that's even physically possible or not, thought that type of pitch is certanly described in baseball lore.

Link to comment
Share on other sites

It's actually a pentagonal prism, bounded by the points of home plate, by the bottom of the kneecaps low, and by the uniform letters high. (technically, the top of the zone is supposed to be the midpoint between the belt and the top of the shoulders, but in practice, this is right around the letters.)

Here's an image that accurately describes it.

http://upload.wikimedia.org/wikipedia/commons/8/89/Strike_zone_en.JPG

That's disappointing. I was hoping the definition included the letters, and I'd definitely be putting the script Baltimore on the part of the shirt you're tucking into your pants.

Link to comment
Share on other sites

I don't think there's any doubt that the technology exists today to very accurately define a dynamic strike zone with a computer-aided system, and track pitches in and around that zone. There are systems today that automatically find, track, identify, target people, vehicles, etc from many miles away. There are systems that track incoming artillery shells moving faster than the speed of sound and almost instantaneously compute and target their origin. Of course some of those systems are expensive military projects. But even in somewhat cheaper, simplified form I have almost complete confidence that baseball could take essentially an off-the-shelf system and implement it in short order, and get far better results than an unaided umpire.

I'm fully aware that, if you delved into military technology, you would easily find examples of systems that could be adapted for this purpose. I would even go a step further, and say that we could accomplish the tracking and decision portions of such a system relatively easily. However, a major challenge of this system is making the data easily accessible to people without a lot of technical training (umpires). Military applications don't have this problem, because either the people evaluating the data are well-trained, or because the systems evaluating the data are designed to blow up.

One solution I can see is a challenge system on balls and strikes, where each team gets up to 3 unsuccessful challenges (+1 for extra innings). Kind of similar to tennis's implementation of Hawk-Eye. I just don't see an easily-implementable solution where umpires can use this data on every call without significantly slowing the game down. (I have considered the possibility of not-so-easily implementable solutions - like augmented reality glasses, but they are areas of ongoing research, and you'd be hard pressed to get an institution like MLB to implement solutions that are still in the R&D phase.)

Link to comment
Share on other sites

I'm fully aware that, if you delved into military technology, you would easily find examples of systems that could be adapted for this purpose. I would even go a step further, and say that we could accomplish the tracking and decision portions of such a system relatively easily. However, a major challenge of this system is making the data easily accessible to people without a lot of technical training (umpires). Military applications don't have this problem, because either the people evaluating the data are well-trained, or because the systems evaluating the data are designed to blow up.

One solution I can see is a challenge system on balls and strikes, where each team gets up to 3 unsuccessful challenges (+1 for extra innings). Kind of similar to tennis's implementation of Hawk-Eye. I just don't see an easily-implementable solution where umpires can use this data on every call without significantly slowing the game down. (I have considered the possibility of not-so-easily implementable solutions - like augmented reality glasses, but they are areas of ongoing research, and you'd be hard pressed to get an institution like MLB to implement solutions that are still in the R&D phase.)

Maybe I'm missing something, but I think it's fairly trivial to have the system tied to a handheld, wireless "buzzer" that beeps or vibrates or flashes (or all three) when the pitch is a strike. Ump keeps it in his hand. Just has to be implemented so it's reliable (with ready spares) and near realtime.

Link to comment
Share on other sites

That's disappointing. I was hoping the definition included the letters, and I'd definitely be putting the script Baltimore on the part of the shirt you're tucking into your pants.

You could still game the definition by having low-rise uniform pants that ended halfway down your crack. The midpoint would drop to around your belly button.

Link to comment
Share on other sites

Maybe I'm missing something, but I think it's fairly trivial to have the system tied to a handheld, wireless "buzzer" that beeps or vibrates or flashes (or all three) when the pitch is a strike. Ump keeps it in his hand. Just has to be implemented so it's reliable (with ready spares) and near realtime.

I was under the assumption that the umpire would still be required to verify the system's results, so he would need access to the ball's trajectory, rather than a simple yes/no response. Machines make mistakes too. And you know that, even if the system were 99.99% effective, that one time in 10,000 pitches where the buzzer went off on a pitch at the ankles would result in cries for a return to human umpires.

Link to comment
Share on other sites

I was under the assumption that the umpire would still be required to verify the system's results, so he would need access to the ball's trajectory, rather than a simple yes/no response. Machines make mistakes too. And you know that, even if the system were 99.99% effective, that one time in 10,000 pitches where the buzzer went off on a pitch at the ankles would result in cries for a return to human umpires.

I was assuming that the home plate ump would continue doing his job much as he always has, but in addition to watching each pitch from behind the catcher he'd have instant feedback from the buzzer. If the ball hit 4' in front of the plate and bounced through the zone he'd ignore the buzzing and call it a ball.

And MLB would track the results. So if an ump had 25 (hell, 5!) calls a game that didn't agree with the system he'd have some 'splainin' to do.

Link to comment
Share on other sites

I was assuming that the home plate ump would continue doing his job much as he always has, but in addition to watching each pitch from behind the catcher he'd have instant feedback from the buzzer. If the ball hit 4' in front of the plate and bounced through the zone he'd ignore the buzzing and call it a ball.

This makes the most sense to me.

Which means the big problem is how to make the system work really fast and not have downtime. Plus security so some hacker in the stands can't buzz strike calls wirelessly to the umpire ;)

Unless some military technology could be used as "Strike Zone Goggles" showing a heads up display of the 3-D strike zone for this particular batter and indicating the real time flight path of the ball.

Cylon Umpires! Or maybe the Borg is a better nerd-alogy.

Link to comment
Share on other sites

You could still game the definition by having low-rise uniform pants that ended halfway down your crack. The midpoint would drop to around your belly button.

Yes! But it's going to be hard convincing Buck, I think that's in the same category as wearing you hat crooked. And don't even start with Trembley, that's not respecting the game!

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.


  • Posts

    • Did someone say McKenna and "white hot?" Never ever thought I'd be around to read that and I like - root for McKenna. Again, I want to see more of Kjerstad before they send him down. How Hyde does it, I don't care. 
    • I’m going to miss his exchanges with Boxscore Billy.   🙂
    • We can agree to disagree.  I'd rather watch Cowser, as awful as he's been because there's upside there and he represents the future.  You know as well as I do that Hays isn't long for this team and he's at his peak or he's already hit it.   If my leftfielder is going to put up a .114/.262/.171 slash line, I at least want to know that he's on the upswing of his career.  Hays has been bad to borderline terrible for over half a season now and I don't believe he's going to get much better.  Might have a hot streak but those hot streaks are seemingly less and less these days.  More nicks and injuries and more prolonged slumps.  
    • I don't think I recall saying that, but if that's the conclusion you're jumping to....I mean, that's part of what you do, take things people say, twist them around and then take them and run.  Or play a semantics game.  So I'm assuming that's what you'll do here. But, just to be difficult, yeah, I'd rather watch McKenna.  Hays has been harder to watch going back to the 2nd half of last year.   First of all, I'm not a McKenna fan but I think he gets a lot of unwarranted flak around here.  Like, he's a perfectly fine 4th or 5th outfielder.  Yeah, he's a bit of a doofus but it's not a big deal.  It's as if people are still holding a grudge against the dropped flyball in Fenway last year and it's just absurd.   He's not great, but he's not completely awful, either. No, I don't want to watch Hays.  I'd rather see Cowser in LF, Mullins in CF and Santander/Kjerstad in RF.   And if that means McKenna gets a little bit more playing time, so be it. 
    • I already talked about 2023 and 2024.  In 2022, his OpS vs lefties was 324. Now, prior to 2022, he was very good vs lefties outside of an awful year in 2018…but when you have to go back 3 years, especially at his age, to justify a stat, that’s not a good thing.
    • That is painting in very broad brush strokes. I'm not certain I would draw the same conclusion as you have. Standard deviation is a measure of how dispersed data is rather than how one variable relates to another. I think in this case a better assessment would be a regression analysis where you are looking at the impact of one independent variable, walks or OBP on a second independent variable, runs or wins. And thats assuming a linear regression, which may not be accurate. 
    • It also applied in the Reds series, but they didn't get on base, so.
  • Popular Contributors

×
×
  • Create New...