+ Reply to Thread
Page 1 of 2 12 LastLast
Results 1 to 15 of 25
  1. #1
    skanar is offline Plus Member Since 10/12 Major League Starter Reputation
    Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation
    Join Date
    Aug 2008
    Posts
    1,588

    HHP: The Comprehensive Report on Statistical Prospecting (part 1 of hopefully and eventually 4)

    Introduction
    For a long time, I've been working on a project to use the statistics generated in the various minor leagues to predict how well a given prospect will do. I've posted in the past some early attempts at this. I have finally got a product that I consider worthy of publishing, at least here on the OH. It isn't finished or complete, yet, but this particular section is.

    I've studied every single hitting prospect who got significant playing time between 1991 and 2005 in four leagues: the Gulf Coast League, the New York-Penn League, the South Atlantic League, and the Carolina League. These are the leagues that the Orioles' minor league affiliates play in at their respective levels. From this set of data, I've made observations and drawn some preliminary conclusions. I've also developed a quick-and-dirty prediction method that identifies prospects from non-prospects.

    This has been a pretty big project, and it isn't over. Next, I'm going to try to refine the prediction methods to be more accurate and more precise (not the same thing). After that, I'm going to try the same methods to predict pitchers. No estimate for when those will be done.


    Methods
    I used Baseball Reference to create a dataset that included every position player who played in the GCL, NYPL, SAL, or CARL between 1991 and 2005. I removed all players who had less than 150 plate appearances. I found those players who made the majors and their respective totals of MLB plate appearances and rWAR. I then categorized all the players with the following codes.
    -1: This prospect never played in the major leagues.
    0: This player has fewer than 600 MLB PAs (cup of coffee).
    1: This player has 600-1600 MLB PAs or this player has <1 rWAR.
    2: This player has at least 1600 MLB PAs and 1-14 rWAR.
    3: This player has at least 1600 MLB PAs and at least 14 rWAR.

    I consider any player categorized as a "2" or a "3" to be a successful prospect. Even with this fairly low bar, only 5.65% of the 7630 prospects in the dataset qualify as successful.

    Separate seasons by one player were treated as completely separate datapoints. If a player split his season between two leagues in one year, and got at least 150 PA in each, that was treated as two independent datapoints. If a player played at the same league in multiple years, that was also treated independently. This is not an ideal solution, but I think acceptable for now.

    During dataset construction, I removed any prospect who made the majors as a pitcher. It seems impossible to me to be able to predict the chance that a position-player-to-pitcher conversion will work from a prospect's hitting stats. But, I did not want to simply list them as failed hitters. So I simply excluded them from the dataset. This is probably not the best solution, but there were fewer than 10 of these players in total, anyway.

    When making predictions or developing systems using this data, it is important to note that the 150 PA cutoff is a completely hard line. Since players with fewer than 150 PA were excluded, I have quite literally NO idea how well they do. THIS DATASET CANNOT MAKE ANY PREDICTION ABOUT A PROSPECT WITH FEWER THAN 150 PA! No excuses, no exceptions.

    My basic method of analysis is to divide the prospects into groups based on known criteria, such as age, OPS, or K%. Attempting to perform direct regressions between, say, OPS and rWAR leads to poor results, since the massive number of minor leaguers who go nowhere overwhelms the actual prospects. Instead, I look to examine how well various categories of players do, then work from there. In this post, I'm not going very deep with prediction, and it's important to be careful.

    For instance, suppose a 21 year old player in the New York-Penn League has an .834 OPS, a K% of 16.2%, and an ISO of .121 (these are, in fact, Trey Mancini's stats). We can look at the NYPL categorizations and find that his chances of success (cat. 2 or 3) given these values are 7.2% (age), 9.4% (OPS), 4.1% (K%), and 6.1% (ISO). BUT IN THE ABSENCE OF TESTING, THESE NUMBERS CANNOT BE COMBINED! Because there is a correlation between the predictive factors (eg, high OPS usually also means high ISO), the predictions are not independent, and any method of combining them must be tested for accuracy, via, for instance, a Brier score. I've started trying some ways, and the best-performing so far is a weighted geometric mean taking only age and OPS into account - but this is a topic for a future post.

    Until I (or someone else) has this sort of evaluation thoroughly tested, IT IS WRONG TO COMBINE PREDICTIONS! Either report them separately, or rely only on age and OPS, which usually have the highest accuracy.

    In any event, the main point here is to present the base statistics, the success rates broken down by components, and my attempt at a first pass to determine who should qualify as a prospect, period.


    Basic Results
    There were 7,630 player-seasons analyzed in this dataset. Of these, 6,140 (80.5%) failed to make the major leagues at all. Another 781 (10.2%) played less than 600 PA in the majors, and are in category 0 - the "cup of coffee" category. 278 players (3.6%) were active less than 3 seasons (1600 PA) or failed to produce significantly above replacement level (<1 rWAR); these players are in category 1, loosely defined as AAAA players. Examples include Nolan Reimold (who has 2.2 rWAR but only 1081 PA), Wily Mo Pena (has 1845 PA but -1.8 rWAR), Geronimo Gil (887 PA, 1.1 rWAR), and Jeff Keppinger (3048 PA, 0.5 rWAR). Decent starters, who have at least 1600 MLB PAs and 1.0 rWAR, but less than 14.0 rWAR, are in category 2, which includes players like Cesar Izturis, Russell Branyan, Hank Blalock, Rajai Davis, Mike Morse, and Nate McLouth. Stars are defined as any player with over 14 rWAR; this includes Magglio Ordonez, Adrian Beltre, and Nick Markakis, and, at the low end, Franklin Gutierrez, Joe Crede, and Edwin Encarnacion.

    Some players are still active and thus able to increase or decrease their rWAR. In the vast majority of cases, these players are already category 2 or 3. The relatively small number of borderline players who are active (Jeff Keppinger is a good example) is an issue for this analysis, but I think a fairly small one.

    The category divisions are somewhat arbitrary, especially the line between 2 and 3. However, for the most part I am only concerned with determining WHETHER a prospect will "succeed" or not, which I define as belonging to either category 2 or category 3, rather than trying to determine just how good he'll be if he does. I am also somewhat concerned with predicting who will make the majors at all. Luckily, the three chances tend to be rather parallel: a higher chance of stardom is indicative of a higher chance of making the majors, and vice versa.

    This suggests to me the first important conclusion: the difference between "ceiling" and "floor" is probably overemphasized by prospect analyzers. You'll often read that a particular prospect has "limited upside" but a "high floor." But it's pretty unlikely to find a prospect that has any kind of floor at all. Prospects fail all the time, and unless you're talking about someone ranked in the national top 25-50 or so (see my post from about 1.5 years ago on the BA top 100 ranking), they're more likely to fail than to succeed. It's possible that these guys are trying to distinguish the future perennial all-star Robinson Canos of the world from the Nick Markakises, but that's a cut that goes pretty fine. Any prospect below AA can fail and is in fact reasonably likely to do so. In most cases, you can predict a prospect's chance of success, then divide it evenly between "future starter" and "future star." One caveat applies: if the prospect is in the absolute YOUNGEST age category for his league (17 and under in the GCL, 18 and under in the SAL or NYPL, 19 and younger or 20 in the CARL), their chance of stardom is a significantly greater portion of their chance of success.

    Below is the first of many graphs, showing prospect success rates by league. This is one of the best ways I've got to show just how few guys actually make it. The blue bar represents the percent that never see the majors; successful players are the purple and teal bands at the top. The other major lesson here is that the NYPL is pretty weak: it has the lowest rates of players both making the majors and having success, presumably due to the presence of non-toolsy college players. We also see the relevance of what is usually called "major-league readiness:" the Carolina League sends the most players and has the highest success rate. (Also, my preliminary Eastern League data, which isn't published here, suggests that the success rate there is even higher.)



    The most important factor in determining prospect success is age relative to league. The next graph shows just the success rate (chance of being in category 2 or 3) for each age/league combination. First, follow the bars of each color, as they decay to the right: this shows that for any league, an older player is less likely to be successful. Then, look at each individual age, and note how much being in a higher league improves a prospect's chance of success.

    One note on this graph: There are nowhere near enough 17-year-old SAL or NYPL players, or enough 17 or 18-year-old CARL players, to plot. They are lumped in with the youngest age category for their respective league. Likewise, the GCL tops out at age 22, the NYPL at age 24, and the SAL and CARL at age 25, with any older players lumped into those values. Only the Carolina League had ANY successful players come out of its highest age category.

    Also note that the highest success rate is only about 30%, for the youngest players in both the SAL and CARL. Without taking performance into account, this is as good as statistical prediction can get. Of course, the goal is to take performance into account.



    As an extension to the previous graph, I'm going to present 4 more, one for each league, that shows the prospect outcome breakdown by age. The bottom two bars of each column of these graphs add to form the columns in the above graph. Note that the chance a prospect of the given age/league did NOT make the majors is given by one minus the total height of the column - so a 20-year-old in the GCL has about a 90% chance of missing the majors.






    Looking at these graphs, we can develop an answer to a very important question: how old is too old for each league? At what age does good performance no longer really matter?

    Your answer to this question will depend on what level of risk you want. At the most extreme, you'd probably pick an age with a literal 0% chance of success. These ages are:
    GCL: 22
    NYPL: 24
    SAL: 25
    CARL: 25

    But we could also go with ages that just severely restrict a prospect's chance. Feel free to use the graphs and decide for yourself! My own preference is for GCL 21, NYPL 22, SAL 23, CARL 24.

    In addition to investigating the relevance of age, I also investigated the relevance of performance, looking at 5 key statistics for each prospect-season: OPS, K%, BB%, ISO, and PA. Plotting these numbers against, say, rWAR is an approach that rarely gives good results, because the immense amount of zeroes overwhelms the signal. Instead, I group these statistics into categories, then find the rate of each result for each category.

    I used four categorization methods. First, I grouped by same-distance changes in the predictive statistic. For example, the OPS groupings were <.600, .600-.699, .700-.799, .800-.899, and >.899. Second, I grouped by three different same-size methods: quintiles, sextiles, and octiles. I investigated the resulting rates by hand. I also plotted the values and attempted to generate a line of best fit, using each statistic to predict the odds of MLB success.

    A few conclusions could be drawn here:
    1) Higher OPS is a very good sign for any prospect in any league. Generally, being in the top OPS category (over .899) gave a prospect the same odds of success as being in the second-youngest age category for that league. Also, very, very few prospects succeed after even one bad year. Of the 1138 prospects that had a <.600 OPS, only 14 (1.2%) succeeded. A low OPS is most forgivable in the GCL: poor-performing prospects there had a 1.8% chance of success; in the other three leagues, it's 0.8%.

    2) Strikeouts matter, and high strikeouts are worst at the lowest levels. No GCL or NYPL prospect succeeded after striking out over 28% of the time. Generally, the fewer strikeouts, the better; however, being in the lowest strikeout category does not boost odds of success as high as being in the highest OPS category.

    3) Walks don't matter. There is near-zero correlation between a prospect's walk rate and his chances of future success. In one or two cases, it appears that walk rates at both extremes (below 5% and above 15%) reduce chances of success, but I suspect this is a sample size artifact, and even if it is real, the effect is small. I believe that a prospect's walk rate can safely be ignored.

    4) ISO is indicative, but probably unhelpful. Higher ISOs generally tend to correlate with improved chances of success. However, the effect isn't enormous, and small ISOs do not tend to reduce chances nearly as much as small OPS or high K%. I suspect that the ISO effect is simply a statistical echo of the OPS effect, since players with high OPS usually often have a high ISO. It's possible that ISO could be useful as a predictor (I can't rule it out), but I don't think it's especially good.

    5) Incredibly, the number of PAs a prospect gets is quite a good predictor of their odds of success. The usual pattern is that PA values below the median or so (180 PA in the GCL, 220 PA in the NYPL, 400 PA in the SAL or CARL) are unpredictive (best fit is a flat line), then the chance of success rapidly increases from that PA value going up. I don't have a good explanation for this: perhaps it's indicative of athletic guys who don't get injured, perhaps teams give their best players the most PA, and perhaps it's a sign that prospects spending an entire year at one level is a superior development system to that of midseason promotions. But the effect is definitely there and, at the high end, quite strong.

    The best-fit graphs for OPS and K% are presented here. The fact that poor performance is more forgiveable at the lower levels is clear, as is the fact that high strikeouts are less forgiveable at the lower levels. The NYPL fits are very weird, which I think is due to the fact that the base rate of success is lowest in the NYPL. Still, it's a little surprising that it would be so different.




    And here are the equations used to make those lines. Note that they are least reliable at the edges, where there are fewer points to constrain them. I especially wouldn't trust the NYPL predictions on the extremes. For the K% equations, input the K% as a decimal (ie .02 for 20%).

    GCL: 0.2945*[OPS]-0.144
    NYPL: 0.00007*e^(8.6027*[OPS])
    SAL: 0.8043*[OPS]^2-0.6694*[OPS]+0.1273
    CARL: 0.5417*[OPS]-0.3121

    GCL: -0.6636*[K%]+0.1802
    NYPL: -61.985*[K%]^3+41.856*[K%]^2-9.4232*[K%]+0.7306
    SAL: 1.4188*[K%]^2-1.0779*[K%]+0.2106
    CARL: -0.628*[K%]+0.1876


    What About Using These In Combination?
    The obvious next step is to try to figure out how these predictors overlap. If being young for the league is good, and having a high OPS is good, how good is having both at the same time? Unfortunately, this is tough to do. At the extremes, the sample sizes tend to be small, and when you have two intersecting extremes, they get so small as to preclude this group-and-calculate-percentages method from being effective. For instance, there are only 63 players in the Sally League age 18 and younger, and of those, only 2 had an OPS above .900: one was Adrian Beltre (cat 3) and one was Delmon Young (cat 2). I'm not comfortable taking a 2-for-2 result and predicting 100% chance of success for future over .900, under 19 SAL players. Likewise, seven players in this group had an OPS between .800 and .900. Five succeeded; two failed. Can we then reliably predict a 71% success rate? I don't think so.

    This sample size problem is frequently compounded by low rates of success. Even in a 150 person sample, if there are only 2 successes, you retain a high rate of variance.

    There are possible solutions; one is to add uncertainties based on the size of the age/league/OPS/K% category a given prospect falls in; another is to avoid grouping so severely and use the best-fit predictors instead. Ideally, we'll test some of these against each other to generate full sets of predictions that can be compared via Brier score. This is the subject of my current research.

    There are also issues of independence. As discussed above for ISO, many of the performance indicators correlate with each other: high K% usually goes hand-in-hand with low OPS, for example. When generating predictions, this reinforcement effect must be accounted for and removed. Testing via Brier score should reveal when overlapping indicators are reinforcing each other to the point that they reduce the effectiveness of the prediction.


    Additional Issues
    One of the biggest problems with this method of prospect evaluation is that it ignores defensive and positional value. Many of the prospects who succeed with worse minor league performance are shortstops or catchers (Jhonny Peralta struck out a ton one year). Taking this into account is very difficult.

    As always, there is no end-all, be-all method of predicting prospects. Scouting is important, and the predicted chance of success will always be a guideline that can be adjusted by perceived defensive ability and/or toolsiness. I would be wary of making BIG adjustments, however. Shifts of 1% are actually quite large, and shifts of 8% are enough to move someone from the fringes to be a top prospect.


    Here's a Recap of the Most Important Conclusions:
    1) Very few prospects make it - for prospects below AA, there is no "floor" of solid-regular, just about ever, unless the prospect is very young.
    2) Age/league is most important.
    3) There is such a thing as too old.
    4) Performance also matters: OPS and K% are important, BB% is not.
    5) The New York-Penn League is weaker and harder to predict than the others.
    6) PAs have an effect, but the cause is unclear.


    Quick-and-Dumb Prediction System
    Ideally, I would like to test a variety of prediction methods by generating full prediction sets and comparing Brier scores for each. Hopefully I'll find the time to do this eventually. Until then, I've got a first step ready.

    The idea is this: we can identify at the very least those players who are very unlikely to succeed and remove them from the pool. Then we can recalculate the success rate for the remaining prospects, and simply use that as the prediction. In essence, it's a simple 2-bin categorization. As a plus, I should be able to use it as I generate a prediction system, as I can ignore those who have already been identified as non-prospects.

    My goal was to have literally zero ultimately successful players in the non-prospect pool; in this, I did not succeed. However, I did manage to keep the non-prospect success rate below 1% in all cases, and there were so few ultimately successful non-prospects that I'm actually able to list all of them.

    So here's the set of criteria that make a given player a non-prospect. Note that meeting ANY of these criteria is enough to rule a player out. The successful players identified as non-prospects by the system are listed by the criterion that ruled them out.

    Code:
    GCL:
    Any player age 22 or older
    Any player age 21 with <.800 OPS
    Any player age 20 with <.750 OPS (causes miss Melvin Mora)
    Any player age 19 with <.650 OPS (causes miss Laynce Nix)
    Any player age 18 or 17 with <.540 OPS
    Any player age 21 who strikes out over 12% of the time
    Any player age 20 who strikes out over 13% of the time (causes miss Travis Hafner)
    Any player age 19 who strikes out over 15% of the time
    Any player who age 18 or 17 who strikes out over 24% of the time (causes miss Garrett Jones)
    No restrictions for players under 17
    
    NYPL:
    Any player age 23 or older (causes misses Ben Zobrist and Jose Macias)
    Any player age 22 with <.650 OPS
    Any player age 21 with <.640 OPS (causes misses Matt Diaz and Toby Hall)
    Any player age 20 or 19 or 18 with <.575 OPS
    No OPS restrictions for players under 18
    Any player age 22 who strikes out over 25% of the time
    Any player age 21 or less who strikes out over 28% of the time
    
    SAL: 
    Any player age 24 or older (causes misses Ben Zobrist, Luke Scott, and Jose Macias)
    Any player age 23 with <.850 OPS (causes miss Nyjer Morgan)
    Any player age 22 with <.675 OPS (causes miss Tony Womack)
    Any player age 21 or 20 with <.650 OPS (causes misses Kevin Stocker, Endy Chavez, and Chris Woodward)
    Any player age 23 who strikes out over 18% of the time
    Any player age 22 who strikes out over 28% of the time*
    Any player age 21 who strikes out over 21% of the time (causes miss Travis Hafner)
    Any player age 20 who strikes out over 31% of the time
    No restrictions for players under 20
    
    *If you set it to 21%, you miss Ryan Howard, Brad Hawpe, Fred Lewis, and Travis Hafner (and eliminate 90 misses). Notably, all four were drafted out of college, and, except for Hafner, in their second pro year.
    
    CARL:
    Any player 24 or older (causes miss Ben Zobrist, Luke Scott 2002, Luke Scott 2003, Luke Scott 2004, Nyjer Morgan)
    Any player age 23 with <.700 OPS
    Any player age 22 with <.620 OPS
    Any player age 21 or 20 with <.580 OPS
    Any player age 23 who strikes out over 24% of the time
    Any player age 22 who strikes out over 25% of the time
    Any player age 21 or 20 who strikes out over 27% of the time
    No restrictions for players under 20
    In most cases, this all makes a lot of sense. The older the player, the better the performance must be to retain prospect status. The only oddity applies to players who were 22 in the Sally League and struck out a lot - but I'm hopeful future work that breaks out recently drafted college players will help there.

    Once you've applied these criteria, you get the following breakdown, given as
    League: success rate for non-prospects, success rate for prospects, percent of ultimately successful players identified as non-prospects

    GCL: 0.54% (4/736), 14.4% (67/466), 5.6% missed
    NYPL: 0.51% (4/785), 6.8% (80/1176), 4.8% missed
    SAL: 0.54% (9/1676), 12.9% (156/1207), 5.5% missed
    CARL: 0.59% (5/841), 14.3% (106/743), 4.5% missed

    I think that's pretty good! We miss about 5% of the ultimately successful players, but we rule out huge numbers of those who are unsuccessful. And it's worth noting that many of the same names keep showing up, and often have unusual circumstances associated with them: of the 22 misses, Ben Zobrist is 3 of them (older college player, debuted at 23); Luke Scott is 4 of them (older college player who then needed time to recover from TJ surgery); Nyjer Morgan is 2 of them; and Travis Hafner is 2 of them. Of course, people often look for excuses, and much of the strength of this method comes from ignoring narratives that are often built by people who are rooting for a particular prospect. So we have to accept the 5% miss rate, at least for now. Hopefully I can do better when I try to work out a comprehensive, prospect-by-prospect prediction system.

    Though it isn't designed to, and there is certainly a different and better set of criteria out there, we can also look at how this prediction method does at telling us who will make the majors. I find these data less useful, since who cares if someone got 40 PA over two seasons with -0.1 rWAR? But anyway, here are the results:

    GCL: 7.1% (52/736), 33.5% (156/466)
    NYPL: 6.0% (47/785), 22.5% (265/1176)
    SAL: 7.0% (118/1676), 37.1% (448/1207)
    CARL: 9.3% (78/841), 43.9% (326/743)

    So that's the quick and dumb prediction method. Use the above criteria to decide whether a player falls into the "prospect" or "non-prospect" category, then predict the corresponding chance of success. Remember, the player MUST have at least 150 PAs within a single league.


    List of Prospects in 2006 Carolina League/list of non-prospects who made majors
    As a means of testing this prediction system, I applied it to the Carolina League in 2006. Below are two lists, one of all the players identified as prospects, the other of all those players identified as non-prospects who have made the majors.

    Code:
    Prospects (* indicates made majors, + indicates success)
    Jeff Corsaletti
    *+Jed Lowrie
    Ian Bladergroen
    Jeff Natale
    Scott White
    *+Jacoby Ellsbury
    Jay Johnson
    *Luke Montz
    *Roger Bernadina
    Marvin Lowrance
    *+Ian Desmond
    *Steve Pearce
    *Brian Bixler
    *Neil Walker
    *Brent Lillibridge
    *Nolan Reimold
    Paco Figueroa
    Dustin Yount
    Stephen Head
    *Jordan Brown
    Micah Schilling
    *Jose Costanza
    *Trevor Crowe
    *Wyatt Toregas
    *Drew Sutton
    Francisco Caraballo
    Beau Torbert
    Ole Sheldon
    Van Pope
    *Matt Young
    *Clint Sammons
    Steve Doetsch
    *Diory Hernandez
    *Brandon Jones
    Kala Ka'aihue
    Sean Smith
    *Donny Lucy
    Jose De Los Santos
    Chris Amador
    
    Non-Prospects who Made Majors (+ indicates success)
    Tony Blanco
    +Nyjer Morgan
    Argenis Reyes
    Brian Barton
    Edwin Maysonet
    
    Non-Prospects who didn't
    7 on Wilmington
    6 on Potomac
    10 on Lynchburg
    10 on Frederick
    6 on Kinston 
    8 on Salem
    6 on Myrtle Beach
    8 on Winston-Salem
    So of the 105 players who had at least 150 PA in the Carolina League that year, I identify 39 prospects and 66 non-prospects. Of the non-prospects, only 5 made the majors (8%) and only 1 was a successful player (1.5%). Of the prospects, 20 made the majors (51%) and 2 were successful (5.1%). The success rate for the prospects is much lower than I would like, but this was a bit of a weak year for the Carolina League, and there is still the chance that some of these players could perform enough in the future to move them to the successful category (Reimold, for instance, needs about 550 more PA to cross the threshold for success).



    2012 Orioles prospects, 2013 Orioles prospects so far
    Using the prediction method, I also went and looked at the Orioles' minor league hitters statistics from last season (2012) and also how they've done so far this season, and gave them the appropriate prediction after sorting. Remember that this is a quick-and-dumb prediction method! Don't try to read too much into the numbers. It's just a sort. Also, it's only those hitters with at least 150 PAs, which limits the numbers for the 2013 GCL and Aberdeen. And of course the 2013 numbers will change between now and the end of the season.

    The early results from my Brier testing suggest that a decent result can be achieved by taking the age/league chance of success, giving it a weight of .6, and the OPS best-fit chance of success, giving it a weight of .4, and finding the mean. I'm presenting that chance of success as well, just for illustration purposes.

    Code:
    2012
    GCL
    Adrian Marin: age 18, .698 OPS, 17.6% K. PROSPECT: 14%/7.9%
    Manuel Hernandez: age 19, .651 OPS, 20.8% K. NON-PROSPECT (Ks): 0.5% 
    
    Aberdeen
    Torsten Boss: age 21, .774 OPS, 19.3% K. PROSPECT: 7%/4.7%
    Anthony Vega: age 21, .623 OPS. NON-PROSPECT (OPS): 0.5%
    Lucas Herbst: age 21, .690 OPS, 15.2% K. PROSPECT: 7%/3.5%
    Joel Hutter: age 22, .648 OPS. NON-PROSPECT (OPS): 0.5%
    Creede Simpson: age 22, .647 OPS. NON-PROSPECT (OPS): 0.5%
    Sam Kimmel: age 22, .745 OPS, 10.9% K. PROSPECT: 7%/3.1%
    Will Howard: age 23. NON-PROSPECT (age): 0.5%
    Cameron Edman: age 24. NON-PROSPECT (age): 0.5%
    
    Delmarva
    Gabriel Lino: age 19, .645 OPS. PROSPECT: 13%/11.3%
    Nick Delmonico: age 19, .762 OPS. PROSPECT: 13%/13.4%
    Connor Narron: age 20, 486 PA, .653 OPS, 20.8%. PROSPECT: 13%/6.4%
    Glynn Davis: age 20, 465 PA, .644 OPS. NON-PROSPECT (OPS): 0.5%
    Wynston Sawyer: age 20, .609 OPS. NON-PROSPECT (OPS): 0.5%
    Michael Ohlman: age 21, .868 OPS, 12.9%: PROSPECT: 13%/9.0%
    Jason Esposito: age 21, 512 PA, .537 OPS. NON-PROSPECT (OPS): 0.5% 
    Brenden Webb: age 22, 409 PA, .878 OPS, 26.4%: PROSPECT: 13%/8.2%
    Mychal Givens: age 22, .625 OPS. NON-PROSPECT (OPS): 0.5%
    Michael Planeta: age 22, .687 OPS, 23.8%: PROSPECT: 13%/3.7%
    Sammie Starr: age 24. NON-PROSPECT (age): 0.5%
    
    Frederick
    Garabez Rosa: age 22, .630 OPS, 22.4%. PROSPECT: 14%/5.1%
    Michael Mosby: age 22, .672 OPS, 25.6%. NON-PROSPECT (Ks): 0.5%
    John Ruettiger: age 22, .702 OPS, 11.6%. PROSPECT: 14%/6.7%
    Justin Dalles: age 23, .629 OPS. NON-PROSPECT (age): 0.5%
    Trent Mummey: age 23, .659 OPS. NON-PROSPECT (age): 0.5%
    Ty Kelly: age 23, .973 OPS, 12.7%. PROSPECT: 14%/10.5%
    Jeremy Nowak: age 24. NON-PROSPECT (age): 0.5%
    Steve Bumbry: age 24. NON-PROSPECT (age): 0.5%
    Joe Oliveira: age 24. NON-PROSPECT (age): 0.5%
    Kipp Schutz: age 24. NON-PROSPECT (age): 0.5%
    Aaron Baker: age 24. NON-PROSPECT (age): 0.5%
    Bobby Stevens: age 25. NON-PROSPECT (age): 0.5%
    Michael Flacco: age 25. NON-PROSPECT (age): 0.5%
    
    
    2013
    GCL
    Ronarsy Ledesma: age 20, .757 OPS, 16.0%. NON-PROSPECT (Ks): 0.5%/4.4%
    Andrickson Zorilla: age 22. NON-PROSPECT (age): 0.5%
    Justin Viele: age 22. NON-PROSPECT (age): 0.5%
    
    Aberdeen
    Hector Veloz: age 19, 31.1%. NON-PROSPECT (Ks): 0.5%
    Trey Mancini: age 21, .824 OPS, 15.9%. PROSPECT: 7%/5.8%
    Jared Breen: age 22, .549 OPS. NON-PROSPECT (OPS): 0.5%
    Connor Bierfeldt: age 22, .788 OPS, 23.6%. PROSPECT: 7%/3.8%
    Mike Yastrzemski: age 22, .835 OPS, 17.8%. PROSPECT: 7%/5.1%
    Jeff Kemp: age 23. NON-PROSPECT (age): 0.5%
    Sam Kimmel: age 23. NON-PROSPECT (age): 0.5%
    
    Delmarva
    Adrian Marin: age 19, .686 OPS. PROSPECT: 13%/11.9%
    Roderick Bernadina: age 20, .615 OPS. NON-PROSPECT (OPS): 0.5%/5.8%
    Wynston Sawyer: age 21, .740 OPS, 16.4%. PROSPECT: 13%/5.9%
    Greg Lorenzo: age 22, .607 OPS. NON-PROSPECT (OPS): 0.5%
    Torsten Boss: age 22, .670 OPS. NON-PROSPECT (OPS): 0.5%
    Lucas Herbst: age 22, .721 OPS, 17.1%. PROSPECT: 13%/4.3%
    Nik Balog: age 23, .690 OPS. NON-PROSPECT (OPS): 0.5%
    Creede Simpson: age 23, .707 OPS. NON-PROSPECT (OPS): 0.5%
    Joel Hutter: age 23, .640 OPS. NON-PROSPECT (OPS): 0.5%
    Tucker Nathans: age 24. NON-PROSPECT (age): 0.5%
    
    Frederick
    Nick Delmonico: age 20, .819 OPS. PROSPECT: 14%/17.9%
    Glynn Davis: age 21, .604 OPS, 18.6%. PROSPECT: 14%/9.2%
    Christian Walker: age 22, .822 OPS, 17.2%. PROSPECT: 14%/9.3%
    Michael Ohlman: age 22, .944 OPS, 22.0%. PROSPECT: 14%/11.9%
    Jason Esposito: age 22, .567 OPS. NON-PROSPECT (OPS): 0.5%
    Brenden Webb: age 23, .632 OPS. NON-PROSPECT (OPS): 0.5%
    John Ruettiger: age 23, .619 OPS. NON-PROSPECT (OPS): 0.5%
    Jerome Pena: age 24. NON-PROSPECT (age): 0.5%
    Sammie Starr: age 25. NON-PROSPECT (age): 0.5%
    Travis Adair: age 25. NON-PROSPECT (age): 0.5%
    Allan de San Miguel: age 25. NON-PROSPECT (age): 0.5%
    Zane Chavez: age 26. NON-PROSPECT (age): 0.5%
    Here's a list of 2013 prospects only, ranked by the weighted-mean prediction:
    1. Nick Delmonico 17.9%
    2. Michael Ohlman 11.9%
    3. Adrian Marin 11.9%
    4. Christian Walker 9.3%
    5. Glynn Davis 9.2%
    6. Wynston Sawyer 5.9%
    7. Trey Mancini 5.8%
    8. Mike Yastrzemski 5.1%
    9. Lucas Herbst 4.3%
    10. Connor Bierfeldt 3.8%
    (Bernadina 5.8%, but a non-prospect)
    (Ledesma 4.4%, but a non-prospect)

    Please note that a lot of the 2013 draftees (especially in the GCL) haven't reached 150 PAs, and therefore can't be judged by this system yet.

    I'm encouraged by these lists. For the most part, it confirms what we already know - which is a good sign. This fairly simple and fairly unsophisticated statistical tool can rule out most of the guys who won't make it. It can't reliably rank the guys who might make it, but I'm working on that. And the early version of it is really promising, because that's more or less the order people would be ranking Orioles hitters, with the exception of Sawyer and Davis above Mancini and Yastrzemski.

    What to take away here: if you are ranking prospects and you include one of the non-prospects on your list (unless it's a very long one), you may want to reconsider.


    What next?
    This is a big project, and it's taken a long time, both in terms of the duration between start date and this post, and the actual hours invested. Next, I'm going to try to duplicate what I just did in this post for pitching prospects, and get a handle on the base expected success rates by age, league, and performance. Unfortunately, there is no single stat for pitchers that is both (1) as informative and (2) as easy to calculate/look up as OPS is for hitters. Of course I intend to look at ERA, K/9, BB/9, K/BB, and WHIP. I expect my IP cutoff to be around 40; aside from just sounding good, it should be about the amount of game time as 150 PA (usually just over 4 PA/IP). The biggest problem will be with evaluation: just using straight WAR probably won't be enough, as I'd like to be able to tell future starters from relievers. But those are bridges to cross when I reach them.

    Additionally, I'd like to develop and refine a prediction method to improve those simple 13% or 14% numbers. The plan is to just calculate Brier scores for a whole bunch of different methods and pick the best. I'm going to try a variety of weighted means, predictions via categorization or fits, etc. etc. I'll let you know when I've got something I like.

    I'd like to develop a prediction system for pitchers. That is quite far in the future, right now.

    I'd also like to add data from the Eastern League, but it takes quite a while to collate, as there are far more major leaguers to look up and the AAAA guys who aren't prospects have to be removed.

    As always, feedback is very welcome. I'm happy to explain my reasoning behind any decision made during the research process or to defend my conclusions. Please remember that prospect prediction is a very inexact science even when you try to put some numbers on it, and I'm not trying to claim otherwise. I consider this a simple first attempt to quantify the definition of a "prospect" and to begin to understand just how likely a given player is to succeed or fail in baseball.


  2. #2
    RZNJ's Avatar
    RZNJ is offline Plus member Since 04/03 Hall of Fame Reputation Reputation Reputation Reputation
    Reputation
    Join Date
    Sep 2003
    Location
    New Jersey
    Posts
    16,095
    Wow! Looks like a lot of work. I will read it eventually.

  3. #3
    Frobby is online now Hangout Blogger Hall of Fame Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation
    Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation
    Join Date
    Dec 2003
    Location
    Bethesda MD
    Posts
    36,208
    skanar, could you please elaborate on your post?

    You could probably turn this in for your doctoral thesis.

  4. #4
    Join Date
    Nov 2004
    Location
    Rehoboth Beach, DE
    Posts
    12,298
    Quote Originally Posted by RZNJ View Post
    Wow! Looks like a lot of work. I will read it eventually.
    Hah, this. I need to table this for an off-day when I have two hours to spare.

  5. #5
    eb45's Avatar
    eb45 is offline Aberdeen Reporter All-Star Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation
    Join Date
    Dec 2007
    Location
    Ohio
    Posts
    2,279
    this is phenomenal- have you considered doing something like it for the other leagues at these levels? I'd be really interested to know how, say, the California League skews prospect development.

    I think one thing to look out for is a bit of chicken-egg situation with regard to BB%: are teams preferring players with more "visible" tools? A lot of player success is determined by how they're managed. I'd be interested to see if, among players who reached at least the 1 threshold, minor league walk rates determined a level of future success.

  6. #6
    skanar is offline Plus Member Since 10/12 Major League Starter Reputation
    Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation
    Join Date
    Aug 2008
    Posts
    1,588
    Quote Originally Posted by eb45 View Post
    this is phenomenal- have you considered doing something like it for the other leagues at these levels? I'd be really interested to know how, say, the California League skews prospect development.
    I'd be interested, too, but it takes a while. Something to put at the end of the list; I want to do pitchers and a more refined prediction method before tackling non-Orioles leagues.

    I think one thing to look out for is a bit of chicken-egg situation with regard to BB%: are teams preferring players with more "visible" tools? A lot of player success is determined by how they're managed. I'd be interested to see if, among players who reached at least the 1 threshold, minor league walk rates determined a level of future success.
    Yeah, one thing I considered was, after identifying certain subsets, to look and see if there were factors that predicted future stardom rather than just success. But it's pretty hard just to make the basic determination, so I think any attempt at doing something more advanced can wait. It's hard enough to tell Nick Markakis (a 3) AND Luis Matos (2) from Felix Pie (1), Val Majewski (0), and Tripper Johnson (-1); trying to differentiate between Matos and Markakis and even Adrian Beltre is a problem I'd rather leave for later/someone else.

    As regards BB%: there's just no good relationship, at any level. Maybe it will show up when I do some intersections, but I doubt it.

    Re: management. This definitely plays a role. The fact that Plate Appearances is so predictive surprised me, and I'm still not sure what to do with it. I'm leaning toward a threefold hypothesis: (1) better players hit at the top of the order; (2) teams play their top prospects as much as possible; and (3) midseason promotions are unwise. But this is a tough, tough thing to figure out.

  7. #7
    Join Date
    Jan 2006
    Location
    St. Michaels
    Posts
    3,949
    What tools do you use for this? R, S, Excel, SQL database, etc?

  8. #8
    skanar is offline Plus Member Since 10/12 Major League Starter Reputation
    Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation
    Join Date
    Aug 2008
    Posts
    1,588
    Quote Originally Posted by srock View Post
    What tools do you use for this? R, S, Excel, SQL database, etc?
    All the data is in excel spreadsheets, which I use for simple counts and graphs. When I do something a little fancier, I port it over to Matlab.

  9. #9
    Join Date
    Jan 2006
    Location
    St. Michaels
    Posts
    3,949
    Quote Originally Posted by skanar View Post
    All the data is in excel spreadsheets, which I use for simple counts and graphs. When I do something a little fancier, I port it over to Matlab.
    How are your programming skills? Python has some wonderful tools for this type of thing. Interesting blog post here.

    I am a bit of Python enthusiast, not a stat head, but a friend of mine is often regaling me with the frustrations of Matlab, SaS, and Excel.

  10. #10
    skanar is offline Plus Member Since 10/12 Major League Starter Reputation
    Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation
    Join Date
    Aug 2008
    Posts
    1,588
    Quote Originally Posted by srock View Post
    How are your programming skills? Python has some wonderful tools for this type of thing. Interesting blog post here.

    I am a bit of Python enthusiast, not a stat head, but a friend of mine is often regaling me with the frustrations of Matlab, SaS, and Excel.
    I'm not a great programmer, but I know enough Matlab to do what I need to. I've used it for research in the past.

    Excel IS terrible, but for the simple counting stuff I'm doing here I've found it to be adequate.

    I tried Python once, though not for statistics. I just don't do enough programming to need to learn it. Maybe sometime in the future.

  11. #11
    Oldorioles is offline Plus Member since 04/11 Major League Starter Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation
    Join Date
    Dec 2006
    Location
    Fallston, MD
    Posts
    1,176
    Very nice work.

  12. #12
    waroriole's Avatar
    waroriole is offline Plus Member Since 6/08 Hall of Fame Reputation Reputation
    Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation
    Join Date
    Jan 2008
    Location
    Philadelphia, PA
    Posts
    14,861
    Just some random thoughts I had while reading this:

    1. For NYPL, the % of guys who make the ML halves after 21. This makes sense, because these guys are a year removed from college. If you're still in short season ball at that point, you're very likely to be organizational filler.

    2. I was a bit surprised that 50% of hitters in the SAL at 19 make the bigs. To me, that reads that if you're a HS draftee, and you don't have to go to short season in your 2nd year, you're 50/50 for being a ML. I thought it would be a lower rate.

    3. In relation to #2, it does make sense that 20 y/o at A+ would have a 60% rate of being ML. The higher you keep going in relation to your age, the better your chances.

    4. Was Zobrist a senior sign? It looks like he was 24 when he played his first MiL game.

    Very interesting read. I'll be curious to see how the pitching report comes out. I imagine there will be less concrete data, due to the nature of arm injuries.

  13. #13
    skanar is offline Plus Member Since 10/12 Major League Starter Reputation
    Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation
    Join Date
    Aug 2008
    Posts
    1,588
    Quote Originally Posted by waroriole View Post
    Just some random thoughts I had while reading this:

    1. For NYPL, the % of guys who make the ML halves after 21. This makes sense, because these guys are a year removed from college. If you're still in short season ball at that point, you're very likely to be organizational filler.
    The NYPL seems a bit weaker in general, except for 21 year olds. This makes a ton of sense: if you're in short-season after being drafted out of HS, it probably means you got time in a rookie league for a year or two but weren't ready/good enough for full-season ball. But lots of the 21 year olds were drafted out of college and then sent to short-season because that's how the timing works out. 20 year olds too, to a lesser extent.

    2. I was a bit surprised that 50% of hitters in the SAL at 19 make the bigs. To me, that reads that if you're a HS draftee, and you don't have to go to short season in your 2nd year, you're 50/50 for being a ML. I thought it would be a lower rate.

    3. In relation to #2, it does make sense that 20 y/o at A+ would have a 60% rate of being ML. The higher you keep going in relation to your age, the better your chances.
    Young prospects are the best prospects. There just aren't all that many of them. That's one reason I'm not thrilled the O's have traded Gabe Lino (19 in SAL) and Delmonico (20 in CARL) in consecutive seasons. Yes, there were serious scouting concerns, but those are the sorts of guys with the best chance to contribute.

    4. Was Zobrist a senior sign? It looks like he was 24 when he played his first MiL game.
    Zobrist and Luke Scott are both annoying, as older players who ended up doing very well. Scott was drafted in 2001, at age 23 (already on the older side), then needed time to recover from TJ surgery. In 2002 (age 24), he spent half a year each at A and A+, then in 2003 half a year each at A+ and AA (age 25). His stats were good at every level, but not exceptionally so.

    Zobrist was also drafted at age 23 and immediately debuted in the NYPL. He split age-24 between A and A+, and spent age 25 mostly at AA. He hit quite well at every level. Notably, his first two partial big league seasons (at age 25 and 26) were quite poor; Zobrist has credited a swing mechanic with fixing issues. The Rays GM has said "he added a power component" and "he became a lot more physical."

    In fact, through the minors, Zobrist hit for a great average, walked a lot, and had doubles power, but through 2007, averaged one HR per 84 PAs. Since 2007, he's averaged one HR every 34 PAs. It is, in fact extremely similar to the improvement in Melvin Mora's numbers. Suspicious? Yeah, probably, but he's never been linked to anything as far as I'm aware.

    Very interesting read. I'll be curious to see how the pitching report comes out. I imagine there will be less concrete data, due to the nature of arm injuries.
    I expect pitchers to (1) have lower overall success rates and (2) be less predictable. But that's the point of getting the numbers. Glad you liked it, but please don't hold your breath for the rest; this has taken me about a year, and it's entirely possible that the pitchers will take just as long.

  14. #14
    Join Date
    Jan 2006
    Location
    St. Michaels
    Posts
    3,949
    Quote Originally Posted by skanar View Post
    Zobrist and Luke Scott are both annoying, as older players who ended up doing very well. Scott was drafted in 2001, at age 23 (already on the older side), then needed time to recover from TJ surgery. In 2002 (age 24), he spent half a year each at A and A+, then in 2003 half a year each at A+ and AA (age 25). His stats were good at every level, but not exceptionally so.

    Zobrist was also drafted at age 23 and immediately debuted in the NYPL. He split age-24 between A and A+, and spent age 25 mostly at AA. He hit quite well at every level. Notably, his first two partial big league seasons (at age 25 and 26) were quite poor; Zobrist has credited a swing mechanic with fixing issues. The Rays GM has said "he added a power component" and "he became a lot more physical."

    In fact, through the minors, Zobrist hit for a great average, walked a lot, and had doubles power, but through 2007, averaged one HR per 84 PAs. Since 2007, he's averaged one HR every 34 PAs. It is, in fact extremely similar to the improvement in Melvin Mora's numbers. Suspicious? Yeah, probably, but he's never been linked to anything as far as I'm aware.
    Zobrist, Mora, and Scott are good examples of when a human touch needs to be applied to the results. Zobrist started pro ball late, Mora didn't play baseball at all until he was like 16, and Scott had an injury.

    Applying a formula such as this is great for sorting out signal from noise and designing a system, but it is important for management to always take an individuals situation into account. Macro stats do not apply well to a specific person.

    Not that anyone is doing this here, just a thought.

  15. #15
    skanar is offline Plus Member Since 10/12 Major League Starter Reputation
    Reputation Reputation Reputation Reputation Reputation Reputation Reputation Reputation
    Join Date
    Aug 2008
    Posts
    1,588
    Quote Originally Posted by srock View Post
    Zobrist, Mora, and Scott are good examples of when a human touch needs to be applied to the results. Zobrist started pro ball late, Mora didn't play baseball at all until he was like 16, and Scott had an injury.

    Applying a formula such as this is great for sorting out signal from noise and designing a system, but it is important for management to always take an individuals situation into account. Macro stats do not apply well to a specific person.

    Not that anyone is doing this here, just a thought.
    Agree with all of this.

    The issue though is that everyone has a narrative, and there are always things you can point to and try to say, "this guy has a shot." I think just looking at the numbers is a huge strength of this approach, because any narratives become irrelevant.

    Missing 5% of the successes is a little higher than I was shooting (hoping for 2-3%). Ideally, the more advanced prediction method will help with this.

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

OriolesHangout.com is an unofficial site and not associated with the Baltimore Orioles and part of Hangout Ventures LLC. Copyright ©2013 | Privacy Policy | Advertise with us