A week or so ago I began a search to find a way to predict and/or determine expected strikeout rates. Jump to the bottom to see the new expected strikeout rates for all 2009 pitchers with 30 or more innings and continue reading on if the process and the statistics interests you.
I initially gathered data from 2003-2009 for all the plate discipline and pitch result categories on fangraphs and statcorner. I ran a regression against K%, and found the significant variables. The adjusted r-squared was very high, and it passed the general validity questions. I could essentially look at the results of a player’s pitches and tell you what his strike out rate would be. Very powerful stuff.
But here comes the obligatory but.
As FreeZorilla pointed out there were definitely some problems. Firstly there was some overlap, namely multicollinearity. This basically means that the independent variables are correlated to each other. It doesn’t ruin the model, but we can do better. Secondly the model had limited application use. It wasn’t essentially using apples to apples, but rather apples to oranges. For example one variable was saying a % of swinging strikes out of total pitches whereas another was the % of contact a hitter made when the pitch was thrown in the zone. I wanted to standardize it and make it easier to understand and apply without taking away predictive power.
To solve these problems I decided to only use the results that could happen once a pitcher throws a baseball.
A ball, a swinging strike, in play, foul, and called strike. For swinging strike, in play, and foul we can also break it down between in the zone and out of the zone. This makes up 100% of all possibilities when a pitcher throws a pitch. In my analysis I found it was better to just use "In Play" or "Foul" instead of In and Out of zone. For swinging strikes I found it better to differentiate between in the zone and out of the zone swinging strikes.
Secondly I made all these metrics as part of total pitches thrown. So now we can say if a pitcher throws 2% more called strikes and 2% less balls his strikeout rate should now be about 1.8% higher.
Also I made the "constant" in the regression formula zero. It is not logical to have a constant that is not equal to zero. If he throws no strikes, or no pitches at all then the pitcher will have a zero strike out rate. The constant must be zero.
Here is the formula. I rounded it to make it a bit easier:
The Adjusted R-Squared is: 91.4%
This essentially means how strong the model is; how much the model explains K%. This is a very strong relationship as many consider just hitting 70% to be good. This figure takes into account how many variables are in the model.
All of the independent variables in the formula are significant at a 99% level of confidence. This is why I threw "Ball%" out. The model also passes the F-Test. Multicollinearity is now not really an issue. And most importantly it passes the simple logic test.
Again any and all help is appreciated. I’m definitely not done looking at this. For example I’ve begun to look at "In zone Contact%" and "Out of zone Contact%" in play of In play and Foul. Perhaps that will lead to a better model since the pitcher has a stronger control over those two metrics compared to in play and foul.
Here are the 2009 expected strikeouts. This is for all pitchers with about 30+ innings. Remember the model was based upon all qualified pitchers, so being able to predict this accurate with pitchers with this small amount of innings is very telling, basically saying it is holding true for even smaller sample sizes. Due to some constraints I'll just post the Rays, Yanks, and Red Sox here. If you want to see all pitchers with about 30+ IP go to this link. The eK's as well as all the components for all the pitchers are in the link.
I highly recommend you check out the link. Some of the best info is in there