Jump to content

Two Myths I'm Ready to Debunk


Frobby

Recommended Posts

It's not just that I disagree with you. It's that the entire premise of your argument is absurd and wrong. But, since I've said that roughly 523 times without you admitting that I'm right, I'll just move on.

Yeah just move on... I'd rather not deal with you on here or in person anyway

Link to comment
Share on other sites

  • Replies 267
  • Created
  • Last Reply

Did you not see the 65 percent success rate? Your question has been answered.

Actually, no, not yet... I had to go do stuff and just stopped by... and now must go do other stuff for a while... I will indeed look at the various links later...

I’ve reviewed the various links that people posted about the question at hand, and here is a summary of what I saw when I followed each of those links.

But first, let’s make sure everybody understands what the fuss is about. My claim is that:

  • Several stat-salesmen around here routinely use stats incorrectly. I ain’t saying they’re trying to, I’m just saying they are.
  • The main way they do by using “normative data” (data about what is normal for large groups of people) improperly by using it as a predictor of what a given individual will do.
  • This error is similar to saying, “The average life expectancy for people born in the year you were born is 76 years. Therefore, we can predict that you will drop dead on or about your 76th birthday.”
  • Then, when somebody points out that such a prediction is blatantly goofy, the response is “Of course, this prediction is not 100% accurate, we *never* said it is. But everybody knows it’s reliable... because we say so.”
  • Then, when somebody points out that this is a lousy justification for doing something that's simply not justified, people start getting all snippy, as if you’re diss’ing Jesus or something.

Instead of having stupid, heated arguments that go round-and-round in circles, it would be better for all of us to simply know what degree of confidence we can have that MiL numbers predict ML performance for a given player. Once we know that, then we can all make more-reasonable judgments about how much weight to put on such numbers vs. other factors, such as the fuzzy things that high-quality coaches, scouts, and other observers might see but which don’t translate into well-defined arguably-objective stats.

Therefore, the question-at-hand is: To what degree can you trust MiL numbers to reliably predict how a given individual will do at the ML level?

Below are the links that various people posted in response to this, along with what I saw when I read them. (I’m not saying that what I saw is what you saw. Sometimes I miss things.)

Link 1: http://www.baseballprospectus.com/statistics/minoreqa.php (in post #54 by Drungo). As I said in post #61, this has nothing to do with what we're talking about. Nothing whatsoever.

Link 2: http://www.baseballprospectus.com/article.php?articleid=2515 (in post #88 by Drungo). This article also has nothing to do with the question. It compares different stat forecasting system to each other, and in all cases focuses on large-ish groups of ML players (sometime all ML players, other times significant subsets of them). It does not touch on the topic of whether any of these tools can do a decent job of predicting the performance of individual players. However, I did find one amusing (or sad?) thing, and it happened in two places. With respect to both hitting and pitching, it noted that all the prediction tools underestimated the variance that actually occurred (which is a fine observation), but then attributed the non-forecasted variance to “luck”. Think about this for a sec. Variance is exactly what one would/could/should expect, just like one would not expect you to drop dead on your 76th birthday. But because the forecasting tools don’t do a good job of modeling reality, they attribute the diff between reality and their tools’ predictions to “luck”. A more responsible comment would be that “The combination of available data and our forecasting tools are still too primitive to adequately deal with the kind of variance that is completely normal in baseball. This fact is demonstrated by the fact that all our tools underestimate variance.” But they don’t say that. Instead, they dismiss normal phenomena that their tools can’t cope with as being “just luck”.

Link 3: http://www.insidethebook.com/ee/index.php/site/comments/how_reliable_are_pecota_forecast_percentiles/ (also in post #88 by Drungo). This is a blog entry with more-than-several follow-up posts. It also does not in any way address the question. The original topic is looking at how 5 different stat-based predictive tools compare at predicting the performance of Chris Carpenter’s “K minus BB per 9-innings”. Along the way, there is commentary about how one method (Marcel) provides reliability scores for its predictions (.83 for Carpenter, other values for other people). The bulk of the commentary is devoted to very arcane stuff about how the different predictive systems work, including many guesses about exactly what they do and don’t do, and much debate among follow-up posters re: the proper way to think about various aspects of it. While I know zip about the details of Marcel or the others, I like it that the Marcel methodology is apparently transparent (based on what people said about it there) and that it includes reliability scores. In another thread, I mentioned that one common rule-of-thumb is that for anything to be considered a strong predictor, it must be over .80 and often must be in the high 90’s (depending on the domain). Somebody here had a fit about that. This is a case of one method of predicting a very-narrow aspect of performance being over that minimal .80 threshold. However, the content of this link is not at all relevant to the questions we’re discussing. It has nothing to do with it.

Link 4: http://krex.k-state.edu/dspace/bitstream/2097/149/1/GaryBrentJohnson2006.pdf (in post #90 by BRobinsonfan). This is a Master’s thesis done by a guy in the Stats department at K-State. It's interesting, but it's a pain in the butt to wade through. The body of it is 53pp long (appendices push it to 78pp). Reading this thing will test the both the patience and knowledge of most folks, but that’s just the nature of these things. (My theory is that you pretty much have to have OCD to be an academic in non-artsy disciplines ;-)

He did not address the question at hand, but he did a bunch of work on a much-more-narrow question that is related to our question. To me, there are three things about this that are interesting:

1. What the guy actually did: He took 2002 stats for MiL hitters, and he took 2005 data about where those guys were in 2005 and how they did in 2005. Based on these 2 data sets (and only these 2 data sets, nothing else), he tried to figure out if there was some way to go back and look at the 2002 MiL numbers in a way that would predict how the guys turned out in 2005. In short, he was trying to “fit” 2002 MiL hitting stats to what actually happened in 2005. He put the 2002 guys into 5 categories about what guys did by 2005: “Not in the bigs”; “Cup-of-coffee in the bigs”; “ML journeyman”; “ML position player”; “ML star”. He then did all manner of stats stuff to figure out which 2002 factors turned out to fit with the 2005 outcome for the 2002 group as a whole. He came up with 4 performance factors, each of which is a calc based on selected hitting stats. Here are the four that proved to matter, along with the labels he used for them: “ability to slug” (a calc based on HR, AB, IsoPower), “lead-off hitting” (a calc based on singles, triples, runs scored, stolen bases and caught-stealing), “patience at the plate” (a calc based on K’s BB’s and OBP), and “pure hitting ability” (BA). In addition, he used factors of “level” (where in the MiL were they in 2002) and OHH (an indication of whether a guy was “over his head” based on age and MiL level). Before people have a fit because he used BA, it’s not that he *started out* trusting BA to matter. Rather, he did factor analysis, regression whosits, etc., and *discovered* that it was one of the 4 performance factors that did indeed matter. (If he ever quantified his system's reliability, I missed it.)

2. As you can see, the question he tried to tackle is much more-narrow than is our question. As you can also see (if you actually look at this thing), he had to do a *ton* of both stats work and careful analysis to address even this narrow question. He makes no bones about the fact that what he did was very limited in scope, and that there are various shortcomings about what he did (good academics tend to this, otherwise they get hammered). I think this is instructive because it helps illustrate how hard the problem is. It’s a very hard problem that requires lots of work like this before we have any realistic hope of an adequate model. I keep saying that the problem of using stats to model Actual Baseball is sufficiently hard that it is not a matter of using stats like a cookbook (even though many folks keep doing exactly that). If nothing else, I think this paper helps demonstrate how true this is.

3. In the end, within the limits of the data he looked at, he took what he discovered about 2002 MiL hitters as a group, applied this to individual 2002 MiL hitters, and came up with a ranking of the “Top 50” 2002 MiL hitters. This ranking is based on what his analysis told him was the best way to look at 2002 MiL stats in order to fit them to what guys did in 2005. It’s fun to look at those rankings (it appears on pp 49-50 of the paper). In addition to the Top 50, he also reports how his scheme ranked a few other notables. Among those in the list: Ryan Howard: #357; Miguel Cabrera: #239; Jason Bay: #141; Rocco Badelli #40; Sean Burroughs: #29; Mark Teixeira: #17; Victor Martinez: #8. As you can see, his scheme said some guys would be lousy who turned out to be quite good. And here’s the Top 5: Scott Hairston at #5, Willy Mo Pena at #4; Travis Hafner at #3; Hee Seop Choi at #2… and the Number 1 guy in his rankings: Jack Cust!

Link 5: http://sports.espn.go.com/espn/page2/story?page=keri/070214 (in post #94 by Hoosiers). This is where we get the .65 value that everybody started quoting around here as if it answers our question. It doesn’t. The .65 value is reported to be what is true for predictions based on 3-year weighted averages for ML guys. We don’t know where that number comes from, but let’s not worry about that, let's just say it’s true. That says nothing about predicting ML numbers based on MiL numbers. We have no information here that helps us answer our question. However, I think it’s safe to say that if the best ways of using ML numbers to predict ML numbers provides .65 reliability, then the best ways to use MiL numbers to predict the same thing will be no better, and is likely to to be worse.

But let’s imagine that .65 is the number we’re looking for (even though it is clearly not the number we’re looking for). What does this .65 number tell us? Some people would say it’s “pretty damn good”. Those people would be wrong. You must keep in mind that a coin flip about whether a guy will meet some yes/no criteria (any such criteria) will be right 50% of the time. So, the best tool we have gives us something that is (a) slightly better than a coin flip, and (b) well below the rule-of-thumb .80 minimum threshold for a strong predictor. In other words, .65 is not a strong predictor. The fact that it may be the best stat-based predictor we have does not change the fact that .65 is a lousy predictor.

So, how useful is .65 as a predictor? If you’re going to Vegas and making a bunch of bets, then .65 will permit you to make a fortune over a long time (assuming you don’t lose your bankroll first). But there is nothing about it that is strong enough to have it override other data points when making decisions about an individual player. You don’t ignore it, but neither do you treat it as being a “strong” or “reliable” predictor because it simply is not. Rather, it's nothing more-or-less than one thing that belongs in the mix as people try to make good judgments. Other data points include the informed, subjective judgment of “high quality observers” such as those coaches and scouts who are good at judging talent and character.

The fact that not all coaches and scouts are good at judging these things does not change this, it only acknowledges that the picture is fuzzy. Stats, on the other hand, provides the false illusion that things are not fuzzy when they really are. The truth of the matter is that baseball is fuzzy, and that stats do not provide an adequate way to model it. Until we have a more-adequate body of stats and adequate ways of using those stats, using MiL numbers as if it is a strong predictor of ML performance is simply wrong. The most you can say about it is what Drungo said when he said that he has a “good feeling” about trusting them. Relying on subjective "feelings” is exactly what stat guys claim that stats enable us to get past, so that we can instead based decisions on things that are "objective". But in the end, subjective "feeling” is the *only* justification we have seen for trusting MiL numbers as anything but a lousy predictor of *individual* ML performance. Is it a predictor? Yes. Is it anything but a weak predictor? No. (Of course, this is based on that phantom .65 value that refers to something else.)

PS: I think maybe the reason that some folks think .65 sounds good is because of the grading system we're used to in school. In that system, 65 is a passing grade. But you can't think like that here. One rule of thumb I remember from ages ago might be wrong, but here's what it is. I assume that these reliability scores are some kind of r-value (or something akin to that). Try taking r-squared and using that to give you a rough sense of "the grade they deserve". Doing it this way, if you take .80 reliability, square it, you get .64. That's "barely passing", as it should be. Take .90 and square it and you get .81, which is a low "B": pretty good but not great. Take .65 and square it, and get .42 which is "flunking" as a good predictor. That's about right. (My stats books are in the garage, buried behind other junk where they belong, so I can't dig up the stat procedure it's based on. I'm just trusting memory here which might be dangerous, so don't hold me to this. I'm not trying to start another argument with this, just trying to help people have a decent quick-and-dirty way to look at it, that's all.)

Link to comment
Share on other sites

The fact that not all coaches and scouts are good at judging these things does not change this, it only acknowledges that the picture is fuzzy. Stats, on the other hand, provides the false illusion that things are not fuzzy when they really are. The truth of the matter is that baseball is fuzzy, and that stats do not provide an adequate way to model it. Until we have a more-adequate body of stats and adequate ways of using those stats, using MiL numbers as if it is a strong predictor of ML performance is simply wrong. The most you can say about it is what Drungo said when he said that he has a “good feeling” about trusting them. Relying on subjective "feelings” is exactly what stat guys claim that stats enable us to get past, so that we can instead based decisions on things that are "objective". But in the end, subjective "feeling” is the *only* justification we have seen for trusting MiL numbers as anything but a lousy predictor of *individual* ML performance. Is it a predictor? Yes. Is it anything but a weak predictor? No. (Of course, this is based on that phantom .65 value that refers to something else.)

Drungo was not using "subjective feelings" in any analysis. He was saying the stats told a story that he was very confident in predicting certain outcomes - like Knott would be a more productive offensive player than Payton even though his model allowed for deviations where Payton could conceivably perform better.

65% is a respectable number. It's enough to win big in Vegas or Wall Street. It's not a number I'd bet my life on though. It's an indicator. However, it's an indicator which has been shown to really outperform the thinking of many GMs 10-15 years ago particularly in identifying overpaid ballplayers who are really producing at the replacement level.

The problem I (and Witchy) have been expressing is that the stat guys here present numbers with such authority and with such mild sounding disclaimers that many people believe that the reliability of what is posted is well in excess of that 65%.

Link to comment
Share on other sites

The problem I (and Witchy) have been expressing is that the stat guys here present numbers with such authority and with such mild sounding disclaimers that many people believe that the reliability of what is posted is well in excess of that 65%.

And the issue is that, even if some of us didn't really know the 65 percent number, we thought that everyone knew that no stat is infallible. That should have been self-evident.

From now on, when someone makes a numbers-based prediction, can we have some kind of shorthand to appease the many parsers of words on this board? Can we develop an emoticon that denotes "This opinion has a 65 percent change of being correct!"?

Maybe that way we won't have 5-page threads of pointless arguments.

Link to comment
Share on other sites

From now on, when someone makes a numbers-based prediction, can we have some kind of shorthand to appease the many parsers of words on this board? Can we develop an emoticon that denotes "This opinion has a 65 percent change of being correct!"?

Maybe that way we won't have 5-page threads of pointless arguments.

Silly me for trying to learn something.

It's not .65. That's about something else. I'm just curious: how close to .50 does it have to get before you'd want the abbreviation to be "coin flip"? (I ain't saying it's a coin flip. I just don't know why we have to push for a consensus that's based on nothing of substance.) So far, we have zero reason to trust this any more than what good scouts, good coaches, or other good observers might say.

Link to comment
Share on other sites

Silly me for trying to learn something.

It's not .65. That's about something else. I'm just curious: how close to .50 does it have to get before you'd want the abbreviation to be "coin flip"? (I ain't saying it's a coin flip. I just don't know why we have to push for a consensus that's based on nothing of substance.) So far, we have zero reason to trust this any more than what good scouts, good coaches, or other good observers might say.

You might have to define "good scout". And I imagine it's going to be a pretty difficult task to put a predictive value on that.

If a good scout is correct better than 75% of the time, I imagine Drungo and Rollie would go with the scout's number. I imagine that such a scout would be a more household name than Bill James, but I doubt this scout exists.

No one is proclaiming to use stats in a vacuum. They are a respectable indicator.

Link to comment
Share on other sites

Here is a theory, try it on for size.

There is a category of hitter who is great at hitting mistake pitches. Hang a curve or throw a fastball down the heart of the plate and they will kill it. But they are incapable of hitting a good breaking ball or a well-placed fastball.

Well, in the minors they can put up pretty good numbers, because minor league pitchers make lots of mistakes. But put them in the majors, where the pitchers make far fewer mistakes, and suddenly they can't hit their weight.

There is another category of hitter who isn't quite as good at crushing the mistake pitches, but is less vulnerable to breaking stuff or well-placed fastballs. That hitter might generate the exact same statistical line at the AAA level as the hitter in my first example, but at the major league level he will do better.

And it's the job of the scouts to distinguish the two types. Does this make sense?

Link to comment
Share on other sites

Here is a theory, try it on for size.

There is a category of hitter who is great at hitting mistake pitches. Hang a curve or throw a fastball down the heart of the plate and they will kill it. But they are incapable of hitting a good breaking ball or a well-placed fastball.

Well, in the minors they can put up pretty good numbers, because minor league pitchers make lots of mistakes. But put them in the majors, where the pitchers make far fewer mistakes, and suddenly they can't hit their weight.

There is another category of hitter who isn't quite as good at crushing the mistake pitches, but is less vulnerable to breaking stuff or well-placed fastballs. That hitter might generate the exact same statistical line at the AAA level as the hitter in my first example, but at the major league level he will do better.

And it's the job of the scouts to distinguish the two types. Does this make sense?

I don't know the answer, but it's a great question.

I wondered about something that boils down to this, but for me it was a much more narrow question and you turned it into a theory. I was pondering House and his sudden disappearance, and here's what I wondered: regardless of how good our MiL pitching coaches are as pitching coaches, they know something about pitching... and they're watching the hitters hit... and I wondered if they looked at House and thought "Well, I know what ML pitchers are gonna do to this guy"... so maybe DT knows this, and gives him enough ABs to where it becomes evident to him that those predictions proved accurate, so he's not getting AB's unless/until he shows Crow that he's fixed whatever it is... (all of this is based on nothing).

Link to comment
Share on other sites

Here is a theory, try it on for size.

There is a category of hitter who is great at hitting mistake pitches. Hang a curve or throw a fastball down the heart of the plate and they will kill it. But they are incapable of hitting a good breaking ball or a well-placed fastball.

Well, in the minors they can put up pretty good numbers, because minor league pitchers make lots of mistakes. But put them in the majors, where the pitchers make far fewer mistakes, and suddenly they can't hit their weight.

There is another category of hitter who isn't quite as good at crushing the mistake pitches, but is less vulnerable to breaking stuff or well-placed fastballs. That hitter might generate the exact same statistical line at the AAA level as the hitter in my first example, but at the major league level he will do better.

And it's the job of the scouts to distinguish the two types. Does this make sense?

The problem is this: What if Hitter X mows down AAA pitching by hitting mistake pitches, but only gets - oh, I don't know - 20 ABs in September to prove that he can hit better pitching?

Are you going to take the scout's word for it based on 20 ABs, or should you give the guy a few hundred ABs to see if 1.) the scouts were wrong, and/or 2.) he might IMPROVE?

It seems absurd to me to have a guy hitting (or pitching) well at AAA, but not give him a chance based on scouts' opinions, no matter how good those scouts are. "Yup, this guy is pretty good, but I've noticed that he does this thing with his left foot...trust me, he's going to suck." Really? You're gonna take that at face value?

Nobody - "stat guys" or scouts - can predict the future. The best you can do is promote people based on 1.) results and 2.) probability of duplicating those results at a higher level. Stats have a large role to play in that, and anyone who doesn't see that...well, I don't know what to say.

Link to comment
Share on other sites

I’ve reviewed the various links that people posted about the question at hand, and here is a summary of what I saw when I followed each of those links.

But first, let’s make sure everybody understands what the fuss is about. My claim is that:

  • Several stat-salesmen around here routinely use stats incorrectly. I ain’t saying they’re trying to, I’m just saying they are.
  • The main way they do by using “normative data” (data about what is normal for large groups of people) improperly by using it as a predictor of what a given individual will do.
  • This error is similar to saying, “The average life expectancy for people born in the year you were born is 76 years. Therefore, we can predict that you will drop dead on or about your 76th birthday.”
  • Then, when somebody points out that such a prediction is blatantly goofy, the response is “Of course, this prediction is not 100% accurate, we *never* said it is. But everybody knows it’s reliable... because we say so.”
  • Then, when somebody points out that this is a lousy justification for doing something that's simply not justified, people start getting all snippy, as if you’re diss’ing Jesus or something.

Instead of having stupid, heated arguments that go round-and-round in circles, it would be better for all of us to simply know what degree of confidence we can have that MiL numbers predict ML performance for a given player. Once we know that, then we can all make more-reasonable judgments about how much weight to put on such numbers vs. other factors, such as the fuzzy things that high-quality coaches, scouts, and other observers might see but which don’t translate into well-defined arguably-objective stats.

Therefore, the question-at-hand is: To what degree can you trust MiL numbers to reliably predict how a given individual will do at the ML level?

Below are the links that various people posted in response to this, along with what I saw when I read them. (I’m not saying that what I saw is what you saw. Sometimes I miss things.)

Link 1: http://www.baseballprospectus.com/statistics/minoreqa.php (in post #54 by Drungo). As I said in post #61, this has nothing to do with what we're talking about. Nothing whatsoever.

Link 2: http://www.baseballprospectus.com/article.php?articleid=2515 (in post #88 by Drungo). This article also has nothing to do with the question. It compares different stat forecasting system to each other, and in all cases focuses on large-ish groups of ML players (sometime all ML players, other times significant subsets of them). It does not touch on the topic of whether any of these tools can do a decent job of predicting the performance of individual players. However, I did find one amusing (or sad?) thing, and it happened in two places. With respect to both hitting and pitching, it noted that all the prediction tools underestimated the variance that actually occurred (which is a fine observation), but then attributed the non-forecasted variance to “luck”. Think about this for a sec. Variance is exactly what one would/could/should expect, just like one would not expect you to drop dead on your 76th birthday. But because the forecasting tools don’t do a good job of modeling reality, they attribute the diff between reality and their tools’ predictions to “luck”. A more responsible comment would be that “The combination of available data and our forecasting tools are still too primitive to adequately deal with the kind of variance that is completely normal in baseball. This fact is demonstrated by the fact that all our tools underestimate variance.” But they don’t say that. Instead, they dismiss normal phenomena that their tools can’t cope with as being “just luck”.

Link 3: http://www.insidethebook.com/ee/index.php/site/comments/how_reliable_are_pecota_forecast_percentiles/ (also in post #88 by Drungo). This is a blog entry with more-than-several follow-up posts. It also does not in any way address the question. The original topic is looking at how 5 different stat-based predictive tools compare at predicting the performance of Chris Carpenter’s “K minus BB per 9-innings”. Along the way, there is commentary about how one method (Marcel) provides reliability scores for its predictions (.83 for Carpenter, other values for other people). The bulk of the commentary is devoted to very arcane stuff about how the different predictive systems work, including many guesses about exactly what they do and don’t do, and much debate among follow-up posters re: the proper way to think about various aspects of it. While I know zip about the details of Marcel or the others, I like it that the Marcel methodology is apparently transparent (based on what people said about it there) and that it includes reliability scores. In another thread, I mentioned that one common rule-of-thumb is that for anything to be considered a strong predictor, it must be over .80 and often must be in the high 90’s (depending on the domain). Somebody here had a fit about that. This is a case of one method of predicting a very-narrow aspect of performance being over that minimal .80 threshold. However, the content of this link is not at all relevant to the questions we’re discussing. It has nothing to do with it.

Link 4: http://krex.k-state.edu/dspace/bitstream/2097/149/1/GaryBrentJohnson2006.pdf (in post #90 by BRobinsonfan). This is a Master’s thesis done by a guy in the Stats department at K-State. It's interesting, but it's a pain in the butt to wade through. The body of it is 53pp long (appendices push it to 78pp). Reading this thing will test the both the patience and knowledge of most folks, but that’s just the nature of these things. (My theory is that you pretty much have to have OCD to be an academic in non-artsy disciplines ;-)

He did not address the question at hand, but he did a bunch of work on a much-more-narrow question that is related to our question. To me, there are three things about this that are interesting:

1. What the guy actually did: He took 2002 stats for MiL hitters, and he took 2005 data about where those guys were in 2005 and how they did in 2005. Based on these 2 data sets (and only these 2 data sets, nothing else), he tried to figure out if there was some way to go back and look at the 2002 MiL numbers in a way that would predict how the guys turned out in 2005. In short, he was trying to “fit” 2002 MiL hitting stats to what actually happened in 2005. He put the 2002 guys into 5 categories about what guys did by 2005: “Not in the bigs”; “Cup-of-coffee in the bigs”; “ML journeyman”; “ML position player”; “ML star”. He then did all manner of stats stuff to figure out which 2002 factors turned out to fit with the 2005 outcome for the 2002 group as a whole. He came up with 4 performance factors, each of which is a calc based on selected hitting stats. Here are the four that proved to matter, along with the labels he used for them: “ability to slug” (a calc based on HR, AB, IsoPower), “lead-off hitting” (a calc based on singles, triples, runs scored, stolen bases and caught-stealing), “patience at the plate” (a calc based on K’s BB’s and OBP), and “pure hitting ability” (BA). In addition, he used factors of “level” (where in the MiL were they in 2002) and OHH (an indication of whether a guy was “over his head” based on age and MiL level). Before people have a fit because he used BA, it’s not that he *started out* trusting BA to matter. Rather, he did factor analysis, regression whosits, etc., and *discovered* that it was one of the 4 performance factors that did indeed matter. (If he ever quantified his system's reliability, I missed it.)

2. As you can see, the question he tried to tackle is much more-narrow than is our question. As you can also see (if you actually look at this thing), he had to do a *ton* of both stats work and careful analysis to address even this narrow question. He makes no bones about the fact that what he did was very limited in scope, and that there are various shortcomings about what he did (good academics tend to this, otherwise they get hammered). I think this is instructive because it helps illustrate how hard the problem is. It’s a very hard problem that requires lots of work like this before we have any realistic hope of an adequate model. I keep saying that the problem of using stats to model Actual Baseball is sufficiently hard that it is not a matter of using stats like a cookbook (even though many folks keep doing exactly that). If nothing else, I think this paper helps demonstrate how true this is.

3. In the end, within the limits of the data he looked at, he took what he discovered about 2002 MiL hitters as a group, applied this to individual 2002 MiL hitters, and came up with a ranking of the “Top 50” 2002 MiL hitters. This ranking is based on what his analysis told him was the best way to look at 2002 MiL stats in order to fit them to what guys did in 2005. It’s fun to look at those rankings (it appears on pp 49-50 of the paper). In addition to the Top 50, he also reports how his scheme ranked a few other notables. Among those in the list: Ryan Howard: #357; Miguel Cabrera: #239; Jason Bay: #141; Rocco Badelli #40; Sean Burroughs: #29; Mark Teixeira: #17; Victor Martinez: #8. As you can see, his scheme said some guys would be lousy who turned out to be quite good. And here’s the Top 5: Scott Hairston at #5, Willy Mo Pena at #4; Travis Hafner at #3; Hee Seop Choi at #2… and the Number 1 guy in his rankings: Jack Cust!

Link 5: http://sports.espn.go.com/espn/page2/story?page=keri/070214 (in post #94 by Hoosiers). This is where we get the .65 value that everybody started quoting around here as if it answers our question. It doesn’t. The .65 value is reported to be what is true for predictions based on 3-year weighted averages for ML guys. We don’t know where that number comes from, but let’s not worry about that, let's just say it’s true. That says nothing about predicting ML numbers based on MiL numbers. We have no information here that helps us answer our question. However, I think it’s safe to say that if the best ways of using ML numbers to predict ML numbers provides .65 reliability, then the best ways to use MiL numbers to predict the same thing will be no better, and is likely to to be worse.

But let’s imagine that .65 is the number we’re looking for (even though it is clearly not the number we’re looking for). What does this .65 number tell us? Some people would say it’s “pretty damn good”. Those people would be wrong. You must keep in mind that a coin flip about whether a guy will meet some yes/no criteria (any such criteria) will be right 50% of the time. So, the best tool we have gives us something that is (a) slightly better than a coin flip, and (b) well below the rule-of-thumb .80 minimum threshold for a strong predictor. In other words, .65 is not a strong predictor. The fact that it may be the best stat-based predictor we have does not change the fact that .65 is a lousy predictor.

So, how useful is .65 as a predictor? If you’re going to Vegas and making a bunch of bets, then .65 will permit you to make a fortune over a long time (assuming you don’t lose your bankroll first). But there is nothing about it that is strong enough to have it override other data points when making decisions about an individual player. You don’t ignore it, but neither do you treat it as being a “strong” or “reliable” predictor because it simply is not. Rather, it's nothing more-or-less than one thing that belongs in the mix as people try to make good judgments. Other data points include the informed, subjective judgment of “high quality observers” such as those coaches and scouts who are good at judging talent and character.

The fact that not all coaches and scouts are good at judging these things does not change this, it only acknowledges that the picture is fuzzy. Stats, on the other hand, provides the false illusion that things are not fuzzy when they really are. The truth of the matter is that baseball is fuzzy, and that stats do not provide an adequate way to model it. Until we have a more-adequate body of stats and adequate ways of using those stats, using MiL numbers as if it is a strong predictor of ML performance is simply wrong. The most you can say about it is what Drungo said when he said that he has a “good feeling” about trusting them. Relying on subjective "feelings” is exactly what stat guys claim that stats enable us to get past, so that we can instead based decisions on things that are "objective". But in the end, subjective "feeling” is the *only* justification we have seen for trusting MiL numbers as anything but a lousy predictor of *individual* ML performance. Is it a predictor? Yes. Is it anything but a weak predictor? No. (Of course, this is based on that phantom .65 value that refers to something else.)

PS: I think maybe the reason that some folks think .65 sounds good is because of the grading system we're used to in school. In that system, 65 is a passing grade. But you can't think like that here. One rule of thumb I remember from ages ago might be wrong, but here's what it is. I assume that these reliability scores are some kind of r-value (or something akin to that). Try taking r-squared and using that to give you a rough sense of "the grade they deserve". Doing it this way, if you take .80 reliability, square it, you get .64. That's "barely passing", as it should be. Take .90 and square it and you get .81, which is a low "B": pretty good but not great. Take .65 and square it, and get .42 which is "flunking" as a good predictor. That's about right. (My stats books are in the garage, buried behind other junk where they belong, so I can't dig up the stat procedure it's based on. I'm just trusting memory here which might be dangerous, so don't hold me to this. I'm not trying to start another argument with this, just trying to help people have a decent quick-and-dirty way to look at it, that's all.)

Forgive me if I'm wrong, but r^2 analysis only matters in issues of causation as opposed to solely correlation. If PECOTA's projections CAUSED the players' stats, then you could use r^2 to say that (.65)^2 or 42% of the variation in players' stats can be explained by changes in PECOTA projections. If that doesn't make sense (which it doesn't), then it isn't applicable in that case.

Link to comment
Share on other sites

I don't know the answer, but it's a great question.

I wondered about something that boils down to this, but for me it was a much more narrow question and you turned it into a theory. I was pondering House and his sudden disappearance, and here's what I wondered: regardless of how good our MiL pitching coaches are as pitching coaches, they know something about pitching... and they're watching the hitters hit... and I wondered if they looked at House and thought "Well, I know what ML pitchers are gonna do to this guy"... so maybe DT knows this, and gives him enough ABs to where it becomes evident to him that those predictions proved accurate, so he's not getting AB's unless/until he shows Crow that he's fixed whatever it is... (all of this is based on nothing).

I don't believe that there is such a massive difference between AAA and the majors that there are types of players who can destroy AAA pitching but utterly fail in the majors. There's no evidence for that. The difference between AAA and MLB is about the same as the difference between AA and AAA, and that's based on comparisons of the performance of literally thousands of players.

The difference between AAA and the majors is one of small degrees, not of a huge step. There are pitchers in AAA with 96 mph fastballs, and knee-buckling curves, and killer change ups. Just not as many as in the majors. You simply can't hit .300 with power in the IL by feasting on mistakes, but completly fall apart in the majors because major league pitchers make those same mistakes. Just not quite as often.

Link to comment
Share on other sites

Drungo was not using "subjective feelings" in any analysis. He was saying the stats told a story that he was very confident in predicting certain outcomes - like Knott would be a more productive offensive player than Payton even though his model allowed for deviations where Payton could conceivably perform better.

65% is a respectable number. It's enough to win big in Vegas or Wall Street. It's not a number I'd bet my life on though. It's an indicator. However, it's an indicator which has been shown to really outperform the thinking of many GMs 10-15 years ago particularly in identifying overpaid ballplayers who are really producing at the replacement level.

The problem I (and Witchy) have been expressing is that the stat guys here present numbers with such authority and with such mild sounding disclaimers that many people believe that the reliability of what is posted is well in excess of that 65%.

Thank you. That's as sensible a post as this thread's seen.

Link to comment
Share on other sites

65% is a respectable number.

I'm not sure what "respectable" means. It's certainly not the same thing as "reliable" which has always been the claim until now. My old-lady neighbor is "respectable" but I wouldn't trust her to predict performance.

The problem I (and Witchy) have been expressing is that the stat guys here present numbers with such authority and with such mild sounding disclaimers that many people believe that the reliability of what is posted is well in excess of that 65%.

This core point is what I've been saying for a long time. However, now we have a new bogus claim: everybody's all-the-sudden latching on to 65% to mean something it doesn't mean. That was for using the last 3-years of ML numbers to predict upcoming ML performance. It has nothing to do with the reliability of MiL numbers as a predictor of ML numbers. I expect that the MiL number would be lower than 65%, but nobody seems to know. Seems odd that nobody would know, but that appears to be the case. Until somebody turns up something, let's not pretend it's 65%. We don't know what it is. But, whatever it is, it ain't gonna be better than 65% and it's likely to be noticeably worse than that.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.


×
×
  • Create New...