Matthew Malkus: A PCA for Batter Similarity Scores (Part 1: Basic Methodology)

weams · March 1, 2015

This is the first in a series of pieces on a tool I've been working on. Admittedly, right now it's quite raw, and probably needs some adjustments, which I'll elaborate on towards the end of this post. It's also quite lengthy - set it aside for when you have ample time to follow along, as there are some example calculations included to demonstrate the process.

Most of you are familiar with the "Similarity Scores" feature on Baseball Reference. If not, the explanation can be found here. The idea is to provide player comps using the player's statistics. This has been around a while, and is based on a fairly simplistic "points-based" approach. Such an approach has the advantage of being easy to follow and intuitive, and as a quick tool to create fun conversation, it's nice. However, it's not very useful for purposes of projection for many reasons - not the least of which being that the points used are arbitrary and the statistics used are result statistics (hits, HRs, RBIs, etc) rather than being process-driven. It's also intended to work on a player's entire career. Some players have one or more drastic shifts in results over the course of their careers - and, to project a player in 2015 from his work in 2013-2014, we need to isolate data by season.

With the mountains of granular data available since Similarity Scores were first published, I thought it would be interesting to take a cut at creating something new in the same vein. My primary objectives were to create a similarity metric that (a) compared individual seasons rather than entire careers; (b) was based primarily on a hitter's "process" or approach at the plate rather than strictly on results which are influenced heavily by luck; and © was mathematically defensible, in other words, non-arbitrary.

I downloaded batted ball and plate discipline data from FanGraphs for all seasons 2002-2014 with 250+ plate appearances. This yielded 4,020 qualifying player seasons. I removed counting statistics (for example, number of infield hits), leaving only rate statistics in the dataset. I also removed any statistic which was derived from other statistics in the dataset (for example, GB/FB ratio, which of course is a ratio of the GB% and FB% statistics already in the dataset). Finally, I augmented the data with a few additional variables: K%, BB%, and ISO. Although these variables contain results which can be influenced by luck, they offer much-needed context used to interpret the ultimate results of the analysis, and tend to be more driven by a player's underlying skill set over a 250+ PA sample.

I then performed a principal component analysis on the dataset. Without getting too far into the weeds on how PCA works, the best way to explain it is that it allows the data to speak for itself. Correlations between variables are taken into account by the process, so as to accurately represent the variability in the system. For example, K% and swinging strike % are highly correlated, and therefore shouldn't be double-counted.

The great thing about PCA is that it creates a set of linear combinations of the variables in the dataset (eigenvectors) which explain the maximum amount of variation in the dataset. These linear combinations can then be interpreted by the user. Ideally, each linear combination will be intuitive or explain some separate skill a hitter possesses, or some phenomenon a hitter endures.

Results of the PCA are summarized in the following table:

[table=width: 500]

[tr]

[td]Eigenvalues[/td]

[td]5.7517[/td]

[td]3.3687[/td]

[td]2.1167[/td]

[td]1.7569[/td]

[td]1.4742[/td]

[td]1.0679[/td]

[td]0.9719[/td]

[/tr]

[tr]

[td]BABIP[/td]

[td]0.0014[/td]

[td]0.0376[/td]

[td]-0.5039[/td]

[td]-0.0161[/td]

[td]0.1951[/td]

[td]-0.3996[/td]

[td]-0.1602[/td]

[/tr]

[tr]

[td]LD%[/td]

[td]-0.0621[/td]

[td]-0.0076[/td]

[td]-0.3039[/td]

[td]-0.0077[/td]

[td]0.4901[/td]

[td]-0.3100[/td]

[td]0.4311[/td]

[/tr]

[tr]

[td]GB%[/td]

[td]-0.2000[/td]

[td]0.1923[/td]

[td]-0.3332[/td]

[td]0.1313[/td]

[td]-0.2881[/td]

[td]0.3746[/td]

[td]-0.2901[/td]

[/tr]

[tr]

[td]FB%[/td]

[td]0.2191[/td]

[td]-0.1803[/td]

[td]0.4543[/td]

[td]-0.1222[/td]

[td]0.0562[/td]

[td]-0.2194[/td]

[td]0.0846[/td]

[/tr]

[tr]

[td]IFFB%[/td]

[td]0.0204[/td]

[td]0.0339[/td]

[td]0.4500[/td]

[td]0.1588[/td]

[td]-0.1623[/td]

[td]-0.2093[/td]

[td]-0.0065[/td]

[/tr]

[tr]

[td]HR/FB[/td]

[td]0.3150[/td]

[td]-0.1494[/td]

[td]-0.0847[/td]

[td]-0.1163[/td]

[td]0.1151[/td]

[td]0.0464[/td]

[td]-0.3903[/td]

[/tr]

[tr]

[td]IFH%[/td]

[td]-0.0308[/td]

[td]0.1262[/td]

[td]-0.0635[/td]

[td]0.1010[/td]

[td]-0.3767[/td]

[td]-0.6465[/td]

[td]-0.3750[/td]

[/tr]

[tr]

[td]O-Swing%[/td]

[td]0.0628[/td]

[td]0.3926[/td]

[td]0.0408[/td]

[td]-0.4780[/td]

[td]-0.0456[/td]

[td]-0.0237[/td]

[td]0.0132[/td]

[/tr]

[tr]

[td]Z-Swing%[/td]

[td]0.2173[/td]

[td]0.2649[/td]

[td]0.0781[/td]

[td]0.1262[/td]

[td]0.3864[/td]

[td]0.1856[/td]

[td]-0.2030[/td]

[/tr]

[tr]

[td]Swing%[/td]

[td]0.1383[/td]

[td]0.4568[/td]

[td]0.1221[/td]

[td]-0.0124[/td]

[td]0.2664[/td]

[td]0.0666[/td]

[td]-0.1362[/td]

[/tr]

[tr]

[td]O-Contact%[/td]

[td]-0.2991[/td]

[td]0.0578[/td]

[td]0.0822[/td]

[td]-0.4378[/td]

[td]-0.0147[/td]

[td]-0.0566[/td]

[td]-0.0617[/td]

[/tr]

[tr]

[td]Z-Contact%[/td]

[td]-0.3719[/td]

[td]-0.0137[/td]

[td]0.1083[/td]

[td]-0.1036[/td]

[td]0.1639[/td]

[td]-0.0254[/td]

[td]-0.1141[/td]

[/tr]

[tr]

[td]Contact%[/td]

[td]-0.3915[/td]

[td]-0.0543[/td]

[td]0.1209[/td]

[td]-0.0605[/td]

[td]0.1612[/td]

[td]-0.0335[/td]

[td]-0.1290[/td]

[/tr]

[tr]

[td]Zone%[/td]

[td]-0.0996[/td]

[td]0.0025[/td]

[td]0.1108[/td]

[td]0.6586[/td]

[td]0.1612[/td]

[td]-0.0717[/td]

[td]-0.0568[/td]

[/tr]

[tr]

[td]F-Strike%[/td]

[td]0.0036[/td]

[td]0.4160[/td]

[td]0.0464[/td]

[td]0.0673[/td]

[td]-0.0562[/td]

[td]-0.1479[/td]

[td]0.1350[/td]

[/tr]

[tr]

[td]SwStr%[/td]

[td]0.3795[/td]

[td]0.1873[/td]

[td]-0.0724[/td]

[td]0.0668[/td]

[td]-0.0654[/td]

[td]0.0431[/td]

[td]0.0758[/td]

[/tr]

[tr]

[td]BB%[/td]

[td]0.0975[/td]

[td]-0.4485[/td]

[td]-0.1303[/td]

[td]-0.0628[/td]

[td]-0.0465[/td]

[td]0.0831[/td]

[td]-0.0357[/td]

[/tr]

[tr]

[td]K%[/td]

[td]0.3280[/td]

[td]0.0110[/td]

[td]-0.1681[/td]

[td]-0.0384[/td]

[td]-0.3081[/td]

[td]-0.0522[/td]

[td]0.3201[/td]

[/tr]

[tr]

[td]ISO[/td]

[td]0.2919[/td]

[td]-0.2020[/td]

[td]0.0451[/td]

[td]-0.1407[/td]

[td]0.2195[/td]

[td]-0.0940[/td]

[td]-0.4241[/td]

[/tr]

[/table]

OK - an explanation of this table is in order.

Row 1 is the eigenvalue. A simple way of thinking about this is that the relative size of this number represents the share of variation explained by the linear weights in that column. The table is sorted by eigenvalue - the most important set of linear weights is on the left, representing about 30.3% (5.7517/19, where 19 is the number of variables) of the total variation in the dataset. The next column represents an additional 17.7% of the variation, after the first 30.3% is already accounted for. And so forth. The data above represents about 86.9% of all hitter variation.

Going down each column are sets of linear weights assigned to each variable. Each column can be used to "score" a player. For example, let's take Nick Markakis' 2014 season as an example to build around. Using this data, we would calculate the "score" on the first component by multiplying the weights in the first column of the first table by Markakis' values on each variable. Starting from the top, Markakis had a 2014 BABIP of .299, a LD% of 19.6%... you get the point. So:

(0.299*0.0014)+(0.196*-0.0621)+(0.459*-0.2000)+....+(0.118*0.3280)+(0.111*0.2919) = -0.7176

We have a number. Great! What does that number mean? Well....nothing, really. It's not in any sort of unit of measure we can comprehend. It's just a number. To interpret it, we'll need to know what the average score is for the dataset, and the variance of scores. Then we can see how far above or below average this score was in context.

We'll also need to interpret what a high score for this metric means, and what a low score means. Take a look at the weights in the first column of the first table, which were used to compute this score. Numbers that are bolded or underlined carry a lot of weight in the score. In this case, to get a high score, a player would probably need to have:

A high HR/FB rate
A high whiff rate (SwStr%)
A high strikeout rate
A low contact rate, particularly on pitches inside the strike zone (Z-Contact%).

To a lesser extent, the underlined values show that a high score would probably represent:

A high ISO
Poor contact outside the zone (O-Contact%), in addition to inside the zone.

What do these characteristics suggest? Interpretation can be tricky, but the combination in the lists above seem to suggest that high scorers are "selling out for power" - they are swinging hard, missing a lot, but hitting more homers because of it.

None of this really sounds like Markakis, so intuitively, we'd think that he should score pretty low here. Indeed, the average score was -0.491; Nick's 2014 season was about 1.60 standard deviations below average. By contrast, you might have just thought of someone like Chris Davis when you read that last paragraph. Indeed, Chris Davis' 2014 season was 2.19 standard deviations above average, and Davis has never logged a season that wasn't at least 1.92 standard deviations above average. The two are different hitters, which is obvious watching them. But now we have systematic proof.

Going through the same process, we can come up with scores for each of the other columns in the first table as well. Again, we'll need to examine what a "high score" means for each column, so that we can interpret the results. In my best judgment, I assigned names to each score/column. The description of each score is below, along with the highest scorer in each category for the 2014 season.

Vector 1: "Sell Out for Power" - already described above. George Springer
Vector 2: "Impatient Hacker" - high scorers are swinging a ton consistently, and are walking quite a bit less than average. Wilson Ramos
Vector 3: "Weak FB Hitter" - high scorers have very low BABIPs because they are popping up and hitting lots of weak flies instead of hitting liners and grounders. Chris Heisey
Vector 4: "Pitchers Attack" - for some reason, high scorers are being thrown a ton of strikes. They don't swing a lot when they are thrown balls. They have marginally lower power than average, so maybe pitchers just aren't afraid of these guys. George Springer (again)
Vector 5: "Balanced Masher" - high scorers are good all-around hitters. They swing at lots of strikes, mash line drives, and don't strike out very much. Freddie Freeman
Vector 6: "Slow GB Hitter" - high scorers are hitting a ton of ground balls, but they aren't getting many infield hits. Bad combination. Everth Cabrera
Vector 7: "Put On a Glove" - my favorite category name. High scorers are striking out a lot, and though they hit a lot of line drives when they connect, they aren't hitting for power or hitting it hard enough for the ball to fall in. They should probably go put on a glove. Eugenio Suarez

Note that these vector names might not capture everything about what the vector represents. For example, no one is suggesting that Everth Cabrera is slow, necessarily - maybe he was just unlucky - but he did hit a whopping 66.9% of balls on the ground, and is over 60% career. Admittedly, these names could be better, and I'm rather open to other suggestions.

Now we can look at z-scores (+/- standard deviations from average score) on each of these 7 metrics and get an idea of what kind of hitter we have on our hands. Continuing with the Markakis and Davis examples...

[table=width: 750]

[tr]

[td]Name[/td]

[td]Year[/td]

[td]Sell Out for Power[/td]

[td]Impatient Hacker[/td]

[td]Weak FB Hitter[/td]

[td]Pitchers Attack[/td]

[td]Balanced Masher[/td]

[td]Slow GB Hitter[/td]

[td]Put On a Glove[/td]

[/tr]

[tr]

[td]Nick Markakis[/td]

[td]2014[/td]

[td]-1.596[/td]

[td]-0.293[/td]

[td]0.344[/td]

[td]-1.698[/td]

[td]-0.603[/td]

[td]-0.173[/td]

[td]-0.131[/td]

[/tr]

[tr]

[td]Chris Davis[/td]

[td]2014[/td]

[td]2.188[/td]

[td]0.005[/td]

[td]-0.560[/td]

[td]-0.616[/td]

[td]-0.343[/td]

[td]0.059[/td]

[td]1.785[/td]

[/tr]

[/table]

Nick comes out looking like the balanced, contact-oriented hitter he was, while Davis looks like a guy who was swinging from the heels and failing a lot. Promising start.

A caveat - as I said, this is very rough at this point. One thing that I should do, which I did not do to this point, is to adjust the data by season and possibly also by ballpark so that different seasons are more comparable (along the same lines as OPS+). I anticipate that the ordering and even the interpretation of the vectors might change once I do this. Particularly, the "Pitchers Attack" score might be highly correlated with time - Zone% has been decreasing by nearly a full percentage point per year over the sample, whether due to a smaller strike zone or for some other reason that doesn't immediately come to mind. I might consider removing steroid-era Barry Bonds from the dataset as an extreme outlier with his absurd 25-35% walk rates, as well.

My next piece will either revolve around de-trending the data to standardize data by season, or how this system would be used to compare player seasons. Sure, Nick Markakis and Chris Davis might not be very similar, but who else are they similar to? The order I do this in probably depends on what sort of feedback I get, and how difficult I find the de-trending process.

malkusm · March 1, 2015

Hi guys, this is Matt. Admittedly this is a bit dense, but I felt it was necessary to explain the "how" before I can get into the more interesting applications. As I alluded to at the end, I have a comparison tool already set up. To whet your appetite for that a bit, you might find the following comps interesting:

Most similar to Travis Snider, 2014:

1. Billy Butler, 2009

2. Chipper Jones, 2011

3. Torii Hunter, 2010

4. Billy Butler, 2011

5. Russell Martin, 2007

Most similar to Colby Rasmus, 2014:

1. Jarrod Saltalamacchia, 2012

2. Jarrod Saltalamacchia, 2011

3. Chris Davis, 2014

4. Oswaldo Arcia, 2014

5. Jay Bruce, 2013

Most similar to Nick Markakis, 2014:

1. Martin Prado, 2013

2. Alberto Callaspo, 2012

3. Maicer Izturis, 2007

4. Yangervis Solarte, 2014

5. Alberto Callaspo, 2011

weams · March 1, 2015

Hi guys, this is Matt. Admittedly this is a bit dense, but I felt it was necessary to explain the "how" before I can get into the more interesting applications. As I alluded to at the end, I have a comparison tool already set up. To whet your appetite for that a bit, you might find the following comps interesting:
Most similar to Travis Snider, 2014:

1. Billy Butler, 2009

2. Chipper Jones, 2011

3. Torii Hunter, 2010

4. Billy Butler, 2011

5. Russell Martin, 2007

Most similar to Colby Rasmus, 2014:

1. Jarrod Saltalamacchia, 2012

2. Jarrod Saltalamacchia, 2011

3. Chris Davis, 2014

4. Oswaldo Arcia, 2014

5. Jay Bruce, 2013

Most similar to Nick Markakis, 2014:

1. Martin Prado, 2013

2. Alberto Callaspo, 2012

3. Maicer Izturis, 2007

4. Yangervis Solarte, 2014

5. Alberto Callaspo, 2011

Fantastic work. Thank you for sharing with us. I hope you get all the rep you deserve for this.

malkusm · March 1, 2015

Fantastic work. Thank you for sharing with us. I hope you get all the rep you deserve for this.

Thanks for posting and all the help! Certainly will share here as I continue to develop it.

Ornithological · March 2, 2015

. As I alluded to at the end, I have a comparison tool already set up.

Nice work! Can you explain the methodology of the comparison tool?

malkusm · March 2, 2015

Nice work! Can you explain the methodology of the comparison tool?

That's my next article (maybe) but for a simplified explanation, it compares the scores that are computed above. This is easier to imagine in two dimensions - if you plotted each player-season on a simple X-Y graph, you'd calculate the distance in a straight line from every player-season to all the others, and the shortest distance would be the most similar. Here we've included 7 dimensions (each of the columns in my first table), but the idea is the same - the exception being that some of the axes of the graph (the leftmost vectors) are more important than others. So, where in geometry it doesn't matter if you are apart by 1 unit of X or 1 unit of Y, here the first vector ("Sell Out for Power") matters more than the others, when considering what is "nearest".

BohKnowsBmore · March 2, 2015

That's my next article (maybe) but for a simplified explanation, it compares the scores that are computed above. This is easier to imagine in two dimensions - if you plotted each player-season on a simple X-Y graph, you'd calculate the distance in a straight line from every player-season to all the others, and the shortest distance would be the most similar. Here we've included 7 dimensions (each of the columns in my first table), but the idea is the same - the exception being that some of the axes of the graph (the leftmost vectors) are more important than others. So, where in geometry it doesn't matter if you are apart by 1 unit of X or 1 unit of Y, here the first vector ("Sell Out for Power") matters more than the others, when considering what is "nearest".

So is the theory, put very simply, that you first perform triage of various archetypes first (sell out for power, high contact, etc) then differentiate/sort within those larger groupings?

Like, mark reynolds and nick markakis may both have high walk rates, but you wouldn't want them anywhere near each other in this similarity sorting?

Sent from my iPhone using Tapatalk

malkusm · March 2, 2015

So is the theory, put very simply, that you first perform triage of various archetypes first (sell out for power, high contact, etc) then differentiate/sort within those larger groupings?
Like, mark reynolds and nick markakis may both have high walk rates, but you wouldn't want them anywhere near each other in this similarity sorting?

Sent from my iPhone using Tapatalk

Yeah basically - these scores are formed first, and then I compare differences of the scores, rather than differences in the raw data. The scores do the heavy lifting of figuring out what elements are most important in differentiating hitters.

In the examples I gave above I intentionally removed comparisons to the same player. For example, Nick Markakis' 2011 season is a good comparison to Nick Markakis' 2014 season. That's not a very interesting thing to say, but it does provide some comfort that the system is working reasonably well.

malkusm · March 2, 2015

FanGraphs published this on their community blog: http://www.fangraphs.com/community/a-pca-for-batter-similarity-scores-part-1-basic-methodology/

Hopefully I can get part 2 done by next week. Depends on how busy my real job is. And I plan to mostly spend my Spring Training trip relaxing and watching baseball instead of writing about it.

justD · March 2, 2015

This is fascinating, I read the article on Fangraphs and in addition to the work itself, your writing is extremely easy to read. Thank you so much for the effort!

I find player comps to be useful when evaluating a relatively young player, or perhaps one not well known in the AL East such as Everth Cabrera (thank you, I think, for the info given for him in the article )

Since the raw data comes from Baseball Reference, does this mean your tool compares Major League players to other Major League players, rather than projecting the future of a particular rookie or prospect, to guess who he might become?

malkusm · March 2, 2015

This is fascinating, I read the article on Fangraphs and in addition to the work itself, your writing is extremely easy to read. Thank you so much for the effort!
I find player comps to be useful when evaluating a relatively young player, or perhaps one not well known in the AL East such as Everth Cabrera (thank you, I think, for the info given for him in the article )

Since the raw data comes from Baseball Reference, does this mean your tool compares Major League players to other Major League players, rather than projecting the future of a particular rookie or prospect, to guess who he might become?

Thanks for the kind words!

The work I've done so far only scores past seasons. It doesn't adjust for age, for example. It would take another extension of the model to develop something that could predict 2015 data from what we currently have. It's certainly possible, in theory, and I plan on working on it, but that's probably several stages away. I want to make sure the core methodology is working as well as possible first, which as I mentioned, probably involves performing some adjustments and re-running the scores.

Another application of this that I'd like to do more work on is the consistency of certain players. Nelson Cruz has been very consistent over the past 4 years, for example - he compares favorably to himself. That's probably to be expected of an established hitter in his early/mid 30's. By contrast, Travis Snider's 2014 season looks nothing like his previous seasons. He was a consistent player, until he wasn't. The O's are betting it's a real change. I think it says something that my model thinks it is, since as I mentioned, it's based on approach at the plate, not necessarily the end results.

Sign In

Matthew Malkus: A PCA for Batter Similarity Scores (Part 1: Basic Methodology)

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Who's Online 40 Members, 0 Anonymous, 164 Guests (See full list)

Posts

Popular Contributors