Lately we've seen quite a few posts relating plate discipline and pitch results to walks and strikeouts. Intuitively this makes sense. The scenario that occurs after a pitch is thrown should have a strong link to strikeouts and walks.
This led me down the path of starting a project using these results, both plate discipline and pitch results, to formulate an equation via multiple regression that would predict expected strikeouts and expected unintentional walks. So far on this site we've only compared and contrasted a few of these results, and in reality there are quite a few. I'm sure some haven't even been measured yet that may have a strong impact, and I'm not even totally sure if I was able to grab them all.
This is essentially just the start of the project. I'm not totally sure if the end results will be good or bad. If someone wants to play around or offer suggestions or help in any way please do. I'm sure there are independent variables I missed and quite a few that may be removed. There are tons of possible combinations and tons of tests to check to make sure the model is actually okay to use. So if you want to play around and help please do.
That being said I did find two pretty solid equations. We certainly can improve, but I don't think the results will change that much.
Here are the results. I know many of you don't need or want to get into the statistical stuff and are just interested in what this really means. Essentially the eK and euBB is basd upon certain results (13 possible) ranging from call strikes, first pitch strikes, fouls, out of zone contact, etc
|Years (qualified pitchers)||Adj R-Squared||MAPE||MSE||RMSE|
|2009 Notable Rays Players|
|2009 Other League Notables|
* There is no JP Howell data for 2009 on StatCorner which is why he isn't here
**Both models are pretty accurate, although eK% is very accurate. The euBB% also seems to be biased towards negative errors. This is something that would have to be fixed (hence why help would be great).
All in all I've accumulated have 13 independent variables across 2003-2009 and the two dependent variables for each model, K% and uBB%. I ran my regression on data for qualified pitchers (+/- a few) between 2003-2008 (using 2009 as a test or holdout period).
I believe I found the highest Adj R-squared for both models. Both equations only use 11 of the 13 independent variables. I'll link the workbooks at the end so if you want to look over the models and statistics it will be there. Also I included the numbers for a bunch of different tests so feel free to check them out (I really haven't look real hard at them yet).
Here are the two equations that I believe had the highest Adj R-Sq:
K = 0.34523 + ( (Ball) * -0.092208 ) + ( (ClStr) * 0.642177 ) + ( (SwStr) * 1.35 ) + ( (Foul) * 0.981356 ) + ( (InPly) * -0.343883 ) + ( (Oswing) * -0.015719 ) + ( (Zswing) * -0.146531 ) + ( (Swing) * -0.42555 ) + ( (Ocontact) * -0.038438 ) + ( (Contact) * -0.184088 ) + ( (Fstrike) * -0.000762 )
uBB = 0.58193 + ( (Ball) * 0.05506 ) + ( (ClStr) * -0.443504 ) + ( (SwStr) * -0.303051 ) + ( (Foul) * 0.092248 ) + ( (InPly) * -0.352155 ) + ( (Oswing) * -0.055224 ) + ( (Swing) * -0.366769 ) + ( (Ocontact) * 0.005447 ) + ( (Contact) * -0.173878 ) + ( (Zone) * -0.043222 ) + ( (Fstrike) * -0.053383 )
Like I said before I'm sure there is a better equation out there. I'm sure something simpler as well. Feel free to mess around. I've mixed and matched a bit and in fact I did find another "K" equation from the 03-08 data that actually fits the 09 data better than the equation above. The difference isn't huge though.
These are the independent variables that I used (all in percent):
Balls, Cl Strike, Sw Strike, Foul, In Play, O Swing, Z Swing, Swing, O Contact, Z Contact, Contact, Zone, F Strike
Correlation matrix for K% (split up into two for easier viewing):
|K||Ball||Cl Str||Sw Str||Foul||In Play||O Swi|
|Z Swi||Swing||O Con||Z Con||Con||Zone||F Str|
What is notable:
The correlations pretty much make sense. Swinging strikes is highly correlated with K's. Anything to do with contact, especially in play, is highly negatively correlated. What is really interesting is call strikes,f-strikes, zone aren't what I originally thought. Firstly call strikes is barely negatively correlated. A bit strange, but perhaps that makes sense on some level. Or perhaps that may be a problem with the model. Thoughts? I would have thought Zone and F-Strike would have had a higher correlation. However in a way it makes sense. If you are throwing in the zone (esp on the first strike) you have a higher chance of a ball in play which eliminates the K potential.
Correlation matrix for uBB% (split up into two for easier viewing):
|uBB||Ball||Cl Str||Sw Str||Foul||In Ply||O Swi|
|Z Swi||Swing||O Con||Z Con||Con||Zone||F Str|
What is notable:
Well for the most part the obvious things hold true. Balls are highly correlated. Swings and contact for the most part are negatively correlated. A key to limiting BB would be throwing first pitch strikes. That is obviously very intuitive, but the huge negative correlation bares that out.
I think I'm going to limit this post to just this. I'll answer whatever question I can in the comment section. And I do have quite a few comments on the players themselves, but I wanted to save that for comments.
All of the data, as well as the audit for the regressions, will be linked just below. Check them out. Once I hear some suggestions, thoughts, and opinions I'll know what step should be taken next if any step at all.
If you want to play around with the notable and Rays pitchers. For example changing a value for any independent variable for a specific pitcher to see what would happen to their expected rates click here:
If you want to look at the regression, the audit and all the regression statistics, as well as the results of the equation to the sample as well as the holdout period click here:
If you want to run your own regressions based upon the data set (if you want to add a variable you have to find the data and add it to the sheet, deleting is simple...click here:
For anyone interested the largest problem with this project was easily collecting the data between Fangraphs and Statcorner. Once I consolidated the data running the regressions and testing on the holdout period was quite easy. Of course with the sheer quantity of combinations testing everything would be highly time consuming.
My fantasy would be to be able to create an accurate eK or euBB based upon these sorts of variables and then be able to plug them in as part of an expectedFIP.