I recently had access to a lot of baseball data, specifically data on every season of every player in the history of the MLB going back to 1871. Here’s some analysis on how baseball players lose speed and strength (or both) throughout their career. Analysis primarily consisted of variable creation and data queries. Unfortunately, code not available 😦
I’m not too familiar with baseball, so took some time thinking about what might serve as a useful proxy for strength and speed. I came up with a few ideas combining different stats, but given my relative baseball ignorance I decided to search google for any more rigorously tested metrics that others might use.
This led me to a couple of sabermetrics pages where I found speed score and isolated power. For the isolated power calculation, the dataset contains all the necessary stats, so I created this metric. The speed score is composed of a number of different factors (F1 through F6), and while I didn’t have all the required stats (ground into double play) for all factors, I thought a partial speed score would probably serve just fine, so created a modified version incorporating F1 through F4.
Below are distributions for speed and strength scores for each season of each player. Isolated power worked well, but the speed score proved troublesome.
While the isolated power score created a fairly smooth distribution across players and seasons (albeit with long tail errors), the speed score had a number of “bumps,” which the turned out to be inordinate amounts of scores at suspiciously round values like 2.5, 3, 6.5, etc. My speed score turned out to pick up lots of players with very thin data (just a few hits, one base steal, etc.) and regularly map them to somewhat high values. I played around with mapping these erroneous default values to something else, adjusting the constant coefficients in the speed score calculations, considered incorporating player activity for a given season into the score to reward consistent performance…but in the end I decided it was simplest to filter out rows where the player only stepped up to the plate a couple of times in the season.
After this, I cleaned one or two columns to map null values to 0 since it was throwing calculations off, and then created features to measure total years played and which season in the career a given row corresponded to. Next I filtered for players with at least a 5 season career in order to help reduce input from players whose careers might not have lasted long enough to produce a signal change in strength and speed over time. Filtering the data by the first instance for each player (yields the set of 5,035 unique players and their respective correlation coefficients for speed and strength against duration of career.
From these histograms it appears that the correlation between strength and season has relatively little skew and is centered close to zero, while the correlation between speed and season appears more often to be negative. Indeed, the correlation between speed and season for all seasons of all players is -.11 while the correlation between strength and season is +.06. These results indicate that for a greater number of players, speed decreases over time more often than strength does.
I was curious to see whether this effect was more pronounced in the later stages of ones career, i.e. whether player speed decreases at a sharper rate towards the end of ones career than it does in the beginning or over the whole career. I created variables speed_cor_tail and strength_cor_tail to examine the correlations in only the second half of players’ careers, and again in the second half of player’s careers with the addition of their scores from their rookie season. However, in neither case did the results look startlingly pronounced (below shows correlations for second half of career + rookie year)
- While the distributions pass an eyeball test for difference, I wouldn’t submit this as a result without some tests for statistical significance. I worked within one system, which didn’t seem to possess library implementations of stat tests.
- I think the tails of these distributions warrant a closer examination, since spikes for correlation of +1, -1, and (somehow)
- A different visualization that plots average speed/strength change across all players over time as a single line would also be a clearer companion to the histogram.
- Correlation measures linearity. I don’t know that the relationship between strength/speed and time is linear (it might look polynomial or something else entirely) but assume that correlation will measure valid signal to a sufficient extent.
- I will assume that bases scored is a good proxy for strength, though it might be the case that hitting the ball has less to do with strength (up to a certain point) and far more to do with skill or accuracy or experience. It could be the case that players who increase in strength over time but lose accuracy will be scored lower. Without more information regarding player strength (fitness tests), benchmarks of force required to hit a home run, information regarding stadium sizes (some stadiums are bigger than others, so a Giants player might play a lot of home games in San Francisco where the stadium is unusually small and he can hit home runs, inflating his strength score)…isolated power is a sufficiently good indicator.
- There is a similar assumption about the validity of the speed score even though we can come up with scenarios in which a slow player receives a high speed score or a fast player receives a low score, or the stats indicating speed really have more to do with skill/experience than with speed: the player is skilled at knowing when to steal bases but is not necessarily faster than average, etc.
- We can ignore the influence of injuries: it is the case that some players experience no loss in speed or strength during their career and thus have season and strength/speed positively correlated because their careers ended in sudden injury before they faced a gradual decline.
- The population of baseball players with long careers remaining in the league is representative of the whole population of players. We assume minimal survivor bias, though it might be the case that, say, strong players who were never valued for their speed were allowed to continue their careers and decline in speed over time primarily because speed was never their asset, whereas all the other players are dropped from the league with the first sign of a loss in strength.
- While incorporating player age may help our analysis, career length and season should be sufficient for determining how the whole population of baseball players change over time, i.e. a player who enters the league at age 18 vs. at age 26 might see a very different strength/speed change over time, so we would need to parse the question of “which lasts longer” a little more thoroughly.