Jump to content

Here is a Problem I have with WAR


waroriole

Recommended Posts

I mean, the question of whether Fangraphs or BB-ref made the better choice of defense-independent pitching statistics is legitimate, but it's a lesser of two evils. In my opinion, BB-ref's has way more serious problems. But what there is no question about is this: you must use a pitching metric that factors out the contributions of the defense. If you want to argue that BB-ref's is better, you're more than welcome to, but you have an uphill battle ahead of you. And that argument is a sideshow to your original point, which was that WAR is flawed because it is predictive rather than descriptive. That is incorrect, and the decision to use FIP for the defense-independent pitching metric was a conscious design choice to keep WAR as descriptive as possible.

At this point, we're repeating ourselves without offering any new evidence. I'm gonna offer a list of objections, but that's probably my last post in this thread.

(1) Why MUST we factor out contributions from defense when we measure a pitcher's performance? If your answer involves some version of "pitchers have no control over balls hit into play," please address the objections raised about groundball pitchers (FIP ignores GB%/LD%) and the small in number but glaring counterexamples (Guthrie, Matt Cain, Walter Johnson, Jim Palmer, etc. etc.) who maintained low BABIPs over extended periods of time. Please remember that FIP completely ignores defense; pitchers who get outs other than via strikeout get no positive credit.

(2) Why not also factor out defensive contributions when measuring offensive performance? For example, why does a double hit into the gap get the same credit as one where the outfielder dove for a catch and missed? Similarly, why don't we give batters some credit when fielders make great plays?

(3) Chris Tillman currently has a 4.69 ERA and a 3.55 FIP; Jeremy Guthrie has a 3.63 ERA and a 4.04 FIP. According to Fangraphs, Tillman has performed slightly better (0.9 WAR) than Guthrie (0.8) this season. Do you agree with that assessment? And would it be reasonable to say that given the listed data, the Orioles played much better defense with Guthrie on the mound?

(4) Is it possible to tell a pitchers' performance from a box score line? If so, consider two pitching box score lines: 8.0 IP, 5 H, 0 R, 2 BB, 3 K; 6.2 IP, 7 H, 2 R, 1 BB, 5 K. Which pitcher would you say performed better? (Pitcher 1's FIP: 3.20. Pitcher 2's FIP: 2.30)

IMO, pitchers DO have some control over ball hit in play: that includes LD%/GB% analysis as well as the possibility that something is going on with pitchers like Guthrie that we don't understand yet. IMO, Jeremy Guthrie has performed far better than Chris Tillman this season, so much so that I am very suspicious of any (comprehensive performance) stat that says otherwise. I think it is entirely possible that the O's play better defense for Guthrie (after all, they have played worse offense when Guthrie is pitching), but I haven't observed it myself and I haven't seen anyone else write about it here on the OH. Finally, I think we can tell a lot about a pitcher's performance from a box score line and I think it's obvious that the first pitcher did better. Even if we would predict that he wouldn't do as well in his next start and that the second pitcher ought to improve, that doesn't change the fact that in those two games, the first was the better pitcher.

Please understand that I am not a Jerseyoriole-type anti-stat guy. Statistics are part of baseball and a big reason why I enjoy the game. On another thread, I've pushed for the use of Shutdowns and Meltdowns as a tool to evaluate reliever performance; it's another Fangraphs metric and I think it may be very valuable. But not all statistics are good, and we need to be critical where we see reason to be so. When it comes to FIP, I have a lot of reservations, and since WAR is often quoted (especially by Fangraphs writers themselves) as the single best measurement of player performance, it's a serious issue.

Link to comment
Share on other sites

  • Replies 99
  • Created
  • Last Reply
At this point, we're repeating ourselves without offering any new evidence. I'm gonna offer a list of objections, but that's probably my last post in this thread.

(1) Why MUST we factor out contributions from defense when we measure a pitcher's performance? If your answer involves some version of "pitchers have no control over balls hit into play," please address the objections raised about groundball pitchers (FIP ignores GB%/LD%) and the small in number but glaring counterexamples (Guthrie, Matt Cain, Walter Johnson, Jim Palmer, etc. etc.) who maintained low BABIPs over extended periods of time. Please remember that FIP completely ignores defense; pitchers who get outs other than via strikeout get no positive credit.

(2) Why not also factor out defensive contributions when measuring offensive performance? For example, why does a double hit into the gap get the same credit as one where the outfielder dove for a catch and missed? Similarly, why don't we give batters some credit when fielders make great plays?

(3) Chris Tillman currently has a 4.69 ERA and a 3.55 FIP; Jeremy Guthrie has a 3.63 ERA and a 4.04 FIP. According to Fangraphs, Tillman has performed slightly better (0.9 WAR) than Guthrie (0.8) this season. Do you agree with that assessment? And would it be reasonable to say that given the listed data, the Orioles played much better defense with Guthrie on the mound?

(4) Is it possible to tell a pitchers' performance from a box score line? If so, consider two pitching box score lines: 8.0 IP, 5 H, 0 R, 2 BB, 3 K; 6.2 IP, 7 H, 2 R, 1 BB, 5 K. Which pitcher would you say performed better? (Pitcher 1's FIP: 3.20. Pitcher 2's FIP: 2.30)

IMO, pitchers DO have some control over ball hit in play: that includes LD%/GB% analysis as well as the possibility that something is going on with pitchers like Guthrie that we don't understand yet. IMO, Jeremy Guthrie has performed far better than Chris Tillman this season, so much so that I am very suspicious of any (comprehensive performance) stat that says otherwise. I think it is entirely possible that the O's play better defense for Guthrie (after all, they have played worse offense when Guthrie is pitching), but I haven't observed it myself and I haven't seen anyone else write about it here on the OH. Finally, I think we can tell a lot about a pitcher's performance from a box score line and I think it's obvious that the first pitcher did better. Even if we would predict that he wouldn't do as well in his next start and that the second pitcher ought to improve, that doesn't change the fact that in those two games, the first was the better pitcher.

Please understand that I am not a Jerseyoriole-type anti-stat guy. Statistics are part of baseball and a big reason why I enjoy the game. On another thread, I've pushed for the use of Shutdowns and Meltdowns as a tool to evaluate reliever performance; it's another Fangraphs metric and I think it may be very valuable. But not all statistics are good, and we need to be critical where we see reason to be so. When it comes to FIP, I have a lot of reservations, and since WAR is often quoted (especially by Fangraphs writers themselves) as the single best measurement of player performance, it's a serious issue.

I've explained twice now why you need to factor out defense. If you don't, you double count the value provided by defense. Try it, if you want.

And, for probably the tenth time in this thread (by other posters), nobody says pitchers have no control over BABIP. But their control is limited for the most part to a pretty tight band in the .290 - .310 range, after you account for their batted ball distribution. The fact is, when you have such a specialized and relatively small sample of major league-caliber pitchers vs. major league-caliber hitters, you just don't see much ability for pitchers to control BABIP. And remember that when you perform any BABIP analysis of a pitcher, you have to account for defense, park, and luck to get to the contribution the pitcher makes via their skill. So that "something is going on" is true - it just doesn't matter very much in the grand scheme of things.

EDIT: Just read this. http://www.fangraphs.com/blogs/index.php/why-our-pitcher-war-uses-fip-part-two/

Link to comment
Share on other sites

Any theories to how Guthrie does it? He's shown an ability to outperform his FIP for a few years and he's not a sinker ball pitcher. It's a large enough sample to say it's a skill instead of luck. I realize he's the exception.

Link to comment
Share on other sites

While I'm not disputing the general validity of BABIP analysis, I personally find it a bit too heavy-handed to be fully reliable. As many have pointed out, FIP and other metrics that seek to adjust for BABIP variability will often underrate pitchers who tend to induce lots of ground balls, which is undeniably a skill and not a product of random chance. A stat where it is known that certain tendencies will produce inaccurate results is not a perfect stat.

If the argument is fWAR vs. rWAR, I will take rWAR, for pitchers at least, because I prefer taking the entire body of results and then addressing the problems with it rather than throwing out a large chunk of results that may very well tell us something. rWAR is guilty of the latter as well to some extent, but not nearly as much as fWAR is. rWAR's analysis comes too far down the chain of events, while fWAR's is not far enough. A version of FIP that took into account GB and LD rates would be much closer to ideal than anything we've got now, in my opinion.

That said, I'm of the opinion that if there is an inconsistency regarding how one player's performance is perceived by different metrics, the response should be to try and find the cause of the inconsistency and use that to come to a more accurate perception of the player, rather than argue about which metric is better.

Link to comment
Share on other sites

While I'm not disputing the general validity of BABIP analysis, I personally find it a bit too heavy-handed to be fully reliable. As many have pointed out, FIP and other metrics that seek to adjust for BABIP variability will often underrate pitchers who tend to induce lots of ground balls, which is undeniably a skill and not a product of random chance. A stat where it is known that certain tendencies will produce inaccurate results is not a perfect stat.

If the argument is fWAR vs. rWAR, I will take rWAR, for pitchers at least, because I prefer taking the entire body of results and then addressing the problems with it rather than throwing out a large chunk of results that may very well tell us something. rWAR is guilty of the latter as well to some extent, but not nearly as much as fWAR is. rWAR's analysis comes too far down the chain of events, while fWAR's is not far enough. A version of FIP that took into account GB and LD rates would be much closer to ideal than anything we've got now, in my opinion.

That said, I'm of the opinion that if there is an inconsistency regarding how one player's performance is perceived by different metrics, the response should be to try and find the cause of the inconsistency and use that to come to a more accurate perception of the player, rather than argue about which metric is better.

But everyone involved is AWARE that a stat like WAR requires taking approximations and making estimates. You will always find inconsistencies in such a stat unless you're omniscient. So I'm arguing that using FIP is better than any current alternative. What are you arguing? Perfection or bust?

Link to comment
Share on other sites

Every respected analyst that I'm aware of says the same thing (the part I put in bold) - I'm not sure why you continue to make false accusations.

As I've been pointing out for about three years now, what separates Guthrie from his teammates at least is a seemingly inate ability to limit production on ground balls. In his career he has allowed the following slash line of 208/208/225/443 on ground balls with a BAbip of 208. Compare that to his teammates: 253/253/275/528 with a BAbip of 253.

This is how he compares to his teammates in fly balls: 217/213/590/803 and 132 versus 229/223/613/836 and 140

And line drives: 715/710/994/1704 and 702 versus 726/725/978/1703 and 719

His overall numbers on GBs, FBs, and LDs: 306/303/515/818 and 272.

Prorate the numbers of his teammates so that they have same batted ball split and they have the following line: 331/328/543/871 and 293

Isn't the problem here with fWAR, not with respected analysts? It's pretty clear that fWAR seems to largely neutralize any possible effect that a pitcher might have on BIP.

Link to comment
Share on other sites

Every respected analyst that I'm aware of says the same thing (the part I put in bold) - I'm not sure why you continue to make false accusations.

Here's a quote from the Fangraphs explanation of FIP that SrMeowMeow linked to:

In the end, we had to choose between two different methods – assuming that the pitcher had no responsibility for the outcome of a ball in play, or attempting to approximate the amount of time that the result was due to the pitcher or the fielder.

And this one, as well:

FIP-based WAR, which is what we ended up using, essentially admits that we don’t have enough information about dividing responsibility for the results of balls in play, and so it ignores them.

I'm not trying to accuse "analysts;" I'm trying to argue that FIP is not a good way to measure production. FIP ignores balls hit into play (aside from the overall contribution to IP). Pitchers get credit for strikeouts and lose credit for walks and homeruns. The composite is divided by their IP, then a correction factor is added to make it comparable to ERA. As a result, a pitcher who never allows a run can have a high FIP (at the extreme of three walks per inning with no Ks and no runs, the FIP would be 6.20).

FIP attempts to factor out defense by ignoring balls hit into play. If pitchers have some amount of control over balls in play, isn't that a weakness of FIP? And therefore a reason why it might be a poor way to measure their performance? I don't understand why this is a hard point to grasp.

The Guthrie groundball production analysis is very interesting. Any idea why that might be the case? And do you know if any of the other pitchers who consistently outperform their FIP (Cain, Palmer, etc.) follow the same pattern?

Link to comment
Share on other sites

But everyone involved is AWARE that a stat like WAR requires taking approximations and making estimates. You will always find inconsistencies in such a stat unless you're omniscient. So I'm arguing that using FIP is better than any current alternative. What are you arguing? Perfection or bust?

As I said earlier in the thread, there seems to be a far higher standard applied to newer, advanced metrics than other numbers. eb45 says "...will often underrate pitchers who tend to induce lots of ground balls". Even if that's true, so? It's a metric that makes a set of assumptions, states that up front, and anyone and everyone is free to point out the assumptions that they don't think make sense.

Every single metric does something similar, and most of the ones we just take at face value are often far worse. Almost any version of WAR is park adjusted and run context adjusted, almost no other commonly used metric is. By itself that makes WAR more valuable and less subject to misinterpretation. ERA+ and OPS+ are context adjusted, but make no attempt to separate out defense from pitching. So even if you disagree with the assumptions in one of the versions of WAR, you're probably going to fall back on a metric that makes an even worse assumption - that the pitcher is responsible for the quality of defense behind him.

I have absolutely no problem debating the merits of the assumptions and construction of any new metric. But I'd appreciate a discussion that doesn't devolve back to "WAR is completely useless and wrong" instead of "I think WAR undervalues Guthrie."

Link to comment
Share on other sites

I don't get it either, Drungo.

I keep seeing statements such as "FIP says pitchers have no control over batted balls" and that is 100% false. Yes, the original DIPS theory came right out and said that, but that was quickly modified. I look at FIP as sort of a second generation of DIPS (actually a new and improved version). I've seen Fangraphs quoted, well this is from their glossary: "pitchers have little control over balls in play, so a better way to assess a pitcher’s talent level is by looking at things a pitcher can control: strikeouts, walks, hit by pitches, and homeruns." Notice it doesn't use the words "no control."

As I stated earlier FIP might not have been my first choice had I attempted to invent fWAR - even though I do understand why the folks at Fangraphs decided to do so. I'd like to see someone from Fangraphs do a comparison of fWAR values using metrics other than FIP. For instance what would fWAR look like for a random group of pitchers with straight ERA used, or with xFIP, or with tERA, or even with BPs SIERA? I think it would be interesting to see the results side-by-side (along with the B-Ref version).

The bolded is an issue I have with using FIP as a component in WAR. WAR's goal is to measure value, not talent level. FIP is probably a better indicator of talent level than RA, but to measure value, as WAR seeks to, the results are more important than the talent level (or at least they should be). The appropriate way to measure value, in my opinion, is to take the results and figure out how much of them a player was responsible for. Talent level and value added are two different things, and while both are useful in analysis, they can't really be mixed together into one stat as fWAR does.

Using the Britton/Tillman example again, Britton's WPA is 1.51 and Tillman's is -0.29. Ignoring the flaws with WPA for a moment, what this tells us is that Britton helped his team win significantly more than Tillman did. Yet Britton only leads in fWAR by a margin of 0.1 (1.0 to 0.9). Tillman may have, as FIP suggests, pitched at a higher talent level, but in terms of value added, Britton was unquestionably far superior. Yet WAR does not show that, because rather than measuring what the player actually contributed to his team in terms of results, Fangraphs tries to measure quality in a stat that should not be about quality.

Link to comment
Share on other sites

The bolded is an issue I have with using FIP as a component in WAR. WAR's goal is to measure value, not talent level. FIP is probably a better indicator of talent level than RA, but to measure value, as WAR seeks to, the results are more important than the talent level (or at least they should be). The appropriate way to measure value, in my opinion, is to take the results and figure out how much of them a player was responsible for. Talent level and value added are two different things, and while both are useful in analysis, they can't really be mixed together into one stat as fWAR does.

Using the Britton/Tillman example again, Britton's WPA is 1.51 and Tillman's is -0.29. Ignoring the flaws with WPA for a moment, what this tells us is that Britton helped his team win significantly more than Tillman did. Yet Britton only leads in fWAR by a margin of 0.1 (1.0 to 0.9). Tillman may have, as FIP suggests, pitched at a higher talent level, but in terms of value added, Britton was unquestionably far superior. Yet WAR does not show that, because rather than measuring what the player actually contributed to his team in terms of results, Fangraphs tries to measure quality in a stat that should not be about quality.

You can't use WPA. It completely bypasses the question of replacement level. Two identical pitchers who happen to be put into two different situations and pitch just as well will have hugely different WPAs. Fangraphs doesn't use FIP as an attempt to measure talent level - they use it because, in their opinion, it's the best compromise available if you want to subtract defense from a pitcher's contributions.

Link to comment
Share on other sites

Of course you can't use WPA as a component in WAR. But for this particular example, it illustrates the point I was making.

And I get why Fangraphs uses FIP in WAR. I'm saying that they shouldn't. FIP is designed to measure how well a pitcher performed, not how much value he added. WAR is designed to measure how much value he added, not how well he performed.

If you want to subtract defense from a pitcher's contributions, then do just that, as B-R does. Don't remove a big chunk of contribution because of defense if you're seeking to measure contributions.

Link to comment
Share on other sites

Of course you can't use WPA as a component in WAR. But for this particular example, it illustrates the point I was making.

And I get why Fangraphs uses FIP in WAR. I'm saying that they shouldn't. FIP is designed to measure how well a pitcher performed, not how much value he added. WAR is designed to measure how much value he added, not how well he performed.

If you want to subtract defense from a pitcher's contributions, then do just that, as B-R does. Don't remove a big chunk of contribution because of defense if you're seeking to measure contributions.

My take is that fWAR attempts to describe how much value the player would have added in a neutral context, adjusted for most of the things that the player doesn't much control. It's still an assessment of value, just not the actual, contextual runs and wins added. I think it's a nice tool to bounce off of rWAR, WPA and other similar metrics that are more descriptive than inferential.

Clearly, the controversy exists BECAUSE fWAR is cited so often, and it is cited so often BECAUSE it is a vast improvement over previous metrics. It's not perfect, and it deserves to be scrutinized. The conversation is good. People should probably use fWAR as one of many methods of evaluating past performance and value.

Still, fWAR is clearly in the top tier of all-in-one vaue metrics.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.




×
×
  • Create New...