Updated Expected Strikeouts based on Pitch Result
A week or so ago I began a search to find a way to predict and/or determine expected strikeout rates. Jump to the bottom to see the new expected strikeout rates for all 2009 pitchers with 30 or more innings and continue reading on if the process and the statistics interests you.
I initially gathered data from 2003-2009 for all the plate discipline and pitch result categories on fangraphs and statcorner. I ran a regression against K%, and found the significant variables. The adjusted r-squared was very high, and it passed the general validity questions. I could essentially look at the results of a player’s pitches and tell you what his strike out rate would be. Very powerful stuff.
But here comes the obligatory but.
As FreeZorilla pointed out there were definitely some problems. Firstly there was some overlap, namely multicollinearity. This basically means that the independent variables are correlated to each other. It doesn’t ruin the model, but we can do better. Secondly the model had limited application use. It wasn’t essentially using apples to apples, but rather apples to oranges. For example one variable was saying a % of swinging strikes out of total pitches whereas another was the % of contact a hitter made when the pitch was thrown in the zone. I wanted to standardize it and make it easier to understand and apply without taking away predictive power.
To solve these problems I decided to only use the results that could happen once a pitcher throws a baseball.
A ball, a swinging strike, in play, foul, and called strike. For swinging strike, in play, and foul we can also break it down between in the zone and out of the zone. This makes up 100% of all possibilities when a pitcher throws a pitch. In my analysis I found it was better to just use “In Play” or “Foul” instead of In and Out of zone. For swinging strikes I found it better to differentiate between in the zone and out of the zone swinging strikes.
Secondly I made all these metrics as part of total pitches thrown. So now we can say if a pitcher throws 2% more called strikes and 2% less balls his strikeout rate should now be about 1.8% higher.
Also I made the “constant” in the regression formula zero. It is not logical to have a constant that is not equal to zero. If he throws no strikes, or no pitches at all then the pitcher will have a zero strike out rate. The constant must be zero.
Here is the formula. I rounded it to make it a bit easier:
K%=(ClStr%*.9)+(Foul%*.5)+(InPly%*-.9)+(InZSwStr%*1.1)+(OZSwStr%*1.5)
The Adjusted R-Squared is: 91.4%
This essentially means how strong the model is; how much the model explains K%. This is a very strong relationship as many consider just hitting 70% to be good. This figure takes into account how many variables are in the model.
All of the independent variables in the formula are significant at a 99% level of confidence. This is why I threw “Ball%” out. The model also passes the F-Test. Multicollinearity is now not really an issue. And most importantly it passes the simple logic test.
Again any and all help is appreciated. I’m definitely not done looking at this. For example I’ve begun to look at “In zone Contact%” and “Out of zone Contact%” in play of In play and Foul. Perhaps that will lead to a better model since the pitcher has a stronger control over those two metrics compared to in play and foul.
Here are the 2009 expected strikeouts. This is for all pitchers with about 30+ innings. Remember the model was based upon all qualified pitchers, so being able to predict this accurate with pitchers with this small amount of innings is very telling, basically saying it is holding true for even smaller sample sizes. Due to some constraints I'll just post the Rays, Yanks, and Red Sox here. If you want to see all pitchers with about 30+ IP go to this link. The eK's as well as all the components for all the pitchers are in the link.
I highly recommend you check out the link. Some of the best info is in there
Expected Strikeouts for 2009 pitchers with 30+ inning as well as component data
| Name | K% | eK% | Difference |
| Andy Sonnanstine | 13.81% | 15.45% | 1.64% |
| David Price | 23.04% | 21.46% | -1.58% |
| Grant Balfour | 23.46% | 20.75% | -2.71% |
| J.P. Howell | 28.41% | 29.85% | 1.44% |
| James Shields | 16.79% | 15.27% | -1.52% |
| Jeff Niemann | 13.49% | 14.18% | 0.69% |
| Joe Nelson | 20.78% | 21.48% | 0.70% |
| Lance Cormier | 13.26% | 14.32% | 1.06% |
| Matt Garza | 21.17% | 18.89% | -2.28% |
| Scott Kazmir | 16.89% | 17.88% | 0.99% |
| Brad Penny | 14.83% | 14.70% | -0.13% |
| Daisuke Matsuzaka | 19.21% | 18.56% | -0.65% |
| Hideki Okajima | 24.36% | 23.65% | -0.71% |
| Jon Lester | 27.41% | 24.60% | -2.81% |
| Jonathan Papelbon | 24.12% | 22.76% | -1.36% |
| Josh Beckett | 22.07% | 21.08% | -0.99% |
| Justin Masterson | 17.95% | 19.02% | 1.07% |
| Manny Delcarmen | 15.79% | 19.17% | 3.38% |
| Ramon Ramirez | 17.53% | 22.33% | 4.80% |
| Tim Wakefield | 12.81% | 12.30% | -0.51% |
| A.J. Burnett | 21.91% | 19.63% | -2.28% |
| Alfredo Aceves | 21.80% | 19.04% | -2.76% |
| Andy Pettitte | 14.67% | 16.15% | 1.48% |
| CC Sabathia | 18.13% | 18.23% | 0.10% |
| Chien-Ming Wang | 12.72% | 14.35% | 1.63% |
| Joba Chamberlain | 19.40% | 19.01% | -0.39% |
| Mariano Rivera | 30.07% | 24.75% | -5.32% |
| Phil Coke | 20.39% | 19.90% | -0.49% |
| Phil Hughes | 18.71% | 19.92% | 1.21% |
9 recs |
52 comments
Comments
Bumped to the FP.
That R-Squared is quite impressive.
by R.J. Anderson on Jul 21, 2009 11:51 AM EDT reply actions 0 recs
Rec'd
Wow Rivera’s K% is 30%+
Can you explain again why Ball% was thrown out? It would seem this would be important.
Follow Me on Twitter @FreeZorilla
by FreeZorilla on Jul 21, 2009 11:57 AM EDT reply actions 0 recs
A few reasons
The coeefficient was extremely small. The T-test was then really low. If I used smaller samples (say one or two years) the balls sometimes became more of a factor, but still largely irrelevant. The largest coefficient I saw was only like 2%. Either way it comes down to the t-test. Basically based upon the sample size and the coefficient the t-test determines how confident we are that it is not zero. I was using 99% confidence and the balls didn’t pass the test. However if left in the coefficient would have been like .005; which compared to the others is largely irrelevant
The reason why I think this is the case is because the other 5 metrics add up to all the possible strikes. So indirectly the balls are being factored in.
by matthan on Jul 21, 2009 12:04 PM EDT up reply actions 0 recs
thanks, makes sense
Follow Me on Twitter @FreeZorilla
by FreeZorilla on Jul 21, 2009 12:18 PM EDT up reply actions 0 recs
Yeah it is pretty strong
I also checked every other regression test. The F-Test was good, and all the variables have huge t-values. There may be some minor issues, mainly in the swinging strike area, but this is a good representation.
by matthan on Jul 21, 2009 11:57 AM EDT reply actions 0 recs
Do any/all of these components stabilize faster than K%? If so then this is really gangbusters.
by Tommy Bennett on Jul 21, 2009 12:05 PM EDT reply actions 0 recs
I haven't really looked at this explicitly
What I did notice is that sample size doesn’t really matter with this formula. I built it by using qualified pitchers, but then applied it to all 2009 pitchers with 30 IP. The errors barely increased.
This leads me to believe that I can perhaps come up with an even more accurate formula if I used 2003-2008 data for say 30+ IP instead of qualified pitchers, but for now this will do.
by matthan on Jul 21, 2009 12:16 PM EDT up reply actions 0 recs
contact rate stabilizes faster
so the miss rate should do the same.
THIS STORY ONLY ENDS ONE WAY
by colintj on Jul 21, 2009 3:39 PM EDT up reply actions 0 recs
Since today is Jeffs turn I'd like to kind of focus on swinging strikes and Mr. Niemann
This is his current components
ClStr-16.4%
Foul-19.3%
InPly-19.7%
InZSwStr-1.8%
OZ SwStr-3.68%
Basically he gets close to no swinging strikes. If you take a look at the link he is definitely near the bottom in swinging strikes for the league. If you take a look at the formula swinging strikes are extremely important. If he just boosts his swinging strikes by 1% he could increase his K rate by quite a bit.
For example if he can turn 1.5% of his balls into out of zone swinging strikes his expected K rate would increase by 2.25%. He would then be expected to strike out nearly 16.5% of batters, which is much more respectable
Then the question is how to get that type of swinging strike? Well a good first step would be pitch selection. As RJ has hammered home awhile: Curveball. I’m just guessing but more curveballs could very well do the trick.
by matthan on Jul 21, 2009 12:13 PM EDT reply actions 0 recs
Great point.
Do you have the league averages handy? Maybe how each Rays starter compares to league average?
by rglass44 on Jul 21, 2009 12:15 PM EDT up reply actions 0 recs
On the link I have all pitchers with 30+ IP for 2009
We could average them out to determine the “average” pitcher. The problem though is it wouldn’t be adjusted based upon innings pitched.
I just did this on the link and this is what I got for the average pitcher in 2009 (30+ IP):
ClStr: 17.7%
Foul-17.4%
InPly-18.9%
InZ SwStr-2.7%
OzSwStr-4.9%
Avg K%-18.1%
Avg eK%-17.9%
by matthan on Jul 21, 2009 12:18 PM EDT up reply actions 0 recs
A few notes:
Very interesting that the variable with the most weight is OOZ swinging strike %. Just goes to show that getting guys to chase bad pitches is key. This is why Price/Kaz sliders areso key for their success.
I’m suprised how heavily called strike % effects it. I guess because the difference in this variable between pitchers should be relatively similar.
by rglass44 on Jul 21, 2009 12:13 PM EDT reply actions 0 recs
Called strikes are often an indicator of working ahead the count which can lead to more OSwngStrk
/Navi’d
Follow Me on Twitter @FreeZorilla
by FreeZorilla on Jul 21, 2009 12:22 PM EDT up reply actions 0 recs
I seem to remember reading that called strike % was relatively stable across pitchers.
That could be wrong, though.
by rglass44 on Jul 21, 2009 12:24 PM EDT up reply actions 0 recs
yea that could make sense
I guess I was referring to frist pitch called strikes
Follow Me on Twitter @FreeZorilla
by FreeZorilla on Jul 21, 2009 12:25 PM EDT up reply actions 0 recs
Here are the standard deviations for 2009
Cl Str-1.8%
Foul-2.2%
In Play-2.3%
InZSwStr-1%
OZSwStr-1.8%
The differences across pitchers aren’t large, but small changes definitely does have a significant impact on K rates.
by matthan on Jul 21, 2009 1:03 PM EDT up reply actions 0 recs
At first glance it seems that turning your foul balls into swinging strikes makes a huge difference
Price, Kaz, and Niemann please take notice
I can't help that I make some things look easier than they really are.
by Sandy Kazmir on Jul 21, 2009 1:10 PM EDT up reply actions 0 recs
The question is: how?
Is it location? Pitch selection? Tougher batter quality?
by R.J. Anderson on Jul 21, 2009 1:12 PM EDT up reply actions 0 recs
Pitch selection would seem the biggest changer to me.
You can move location in and out of zone, but if you go out you run the risk of a guy taking and if you don’t command properly you run the risk of it getting hit hard. Foul balls happen when a guy is expecting a certain pitch, but the pitcher puts it in a good spot. For example, a guy is sitting fastball and gets it, but it’s low and away so he can only foul it off, or a good diving curve that he only gets a piece. I base this thought on the fact that with most foul balls the batter has the timing down. If a pitcher were to mix speeds better I think he could decrease foul balls and convert those into swinging strikes. On a related note, Price’s change was gorgeous last night, he made batters look foolish when he finally had the guts to go to it.
I can't help that I make some things look easier than they really are.
by Sandy Kazmir on Jul 21, 2009 1:24 PM EDT up reply actions 0 recs
Good points.
Changing speeds, and trying to get guys to chase. To do that you have to be ahead in the count. You also have to have a swing-and-miss pitch.
by rglass44 on Jul 21, 2009 1:26 PM EDT up reply actions 0 recs
Great work Matt
I can't help that I make some things look easier than they really are.
by Sandy Kazmir on Jul 21, 2009 1:04 PM EDT reply actions 0 recs
If interested here are the qualified Rays pitchers from 2003-2008
Last First Year K eK% Difference
SonnanstineAndy 2008 15.14% 14.88% -0.26%
Jackson Edwin 2008 13.64% 14.96% 1.32%
Shields James 2008 18.24% 14.41% -3.83%
Garza Matt 2008 16.58% 13.62% -2.96%
Kazmir Scott 2008 20.06% 22.24% 2.18%
Shields James 2007 21.05% 15.82% -5.23%
Kazmir Scott 2007 26.94% 19.59% -7.36%
Fossum Casey 2005 17.66% 23.90% 6.24%
HendricksonMark 2005 11.18% 10.25% -0.93%
Kazmir Scott 2005 21.27% 28.01% 6.74%
HendricksonMark 2004 10.83% 15.05% 4.22%
Zambrano Victor 2003 15.79% 16.89% 1.10%
by matthan on Jul 21, 2009 1:23 PM EDT reply actions 0 recs
A look at Rays pitchers with more than 1 qualified year
Garza 08-09
Kazmir 05-09 (no 06)
Shields 06-09
Sonnanstine 08-09
The average K% for those pitchers is 18.90%, The average ek% is 17.5%
According to this our pitchers have out-performed their underlyings
The Stdev for the K% is 3.7% whereas for eK% it is 3.4%
by matthan on Jul 21, 2009 1:33 PM EDT reply actions 0 recs
Sonnanstine
K% eK%
2009 13.81% 15.45%
2008 15.14% 14.88%
by matthan on Jul 21, 2009 1:34 PM EDT up reply actions 0 recs
Shields
K% eK%
2009 16.79% 15.27%
2008 18.24% 14.41%
2007 21.05% 15.82%
by matthan on Jul 21, 2009 1:34 PM EDT up reply actions 0 recs
This is pretty surprising to me
Based on the model Shields is expected to strike out around 15-15.5% of batters. He has been pretty consistent on that front. However in reality he has been far better than that, but has been declining every single year. This could be a sign of serious regression. Perhaps Shields is more of the 16-17% K guy than the 18-21% he has shown in the past?
by matthan on Jul 21, 2009 1:41 PM EDT up reply actions 0 recs
Kazmir
K% eK%
2009 16.89% 17.88%
2008 20.06% 22.24%
2007 26.94% 19.59%
2005 21.27% 24.21%
by matthan on Jul 21, 2009 1:35 PM EDT up reply actions 0 recs
Actually the 2005 eK rate is quite a bit higher
28%, not sure what happened
Either way both the actual K% and the eK% has been all over the map for Kaz
by matthan on Jul 21, 2009 1:46 PM EDT up reply actions 0 recs
Garza
K% eK%
2009 21.17% 18.89%
2008 16.58% 13.62%
by matthan on Jul 21, 2009 1:35 PM EDT up reply actions 0 recs
Besides for Mark Hendrickson Garzas 2008 was the worst in terms of inducing strikeouts for any Rays qualified SP since 2003
He has no doubt improved big time this year.
by matthan on Jul 21, 2009 1:43 PM EDT up reply actions 0 recs
You've done a lot of good work
And these Rays specific comments are interesting. Maybe a separate post of Rays analysis when you get the chance? You could leave the science out at this point and just link to this with Rays analysis.
Follow Me on Twitter @FreeZorilla
by FreeZorilla on Jul 21, 2009 1:51 PM EDT up reply actions 0 recs
Yeah that sounds like a good idea
I may tweak the formula a bit here and there (by using a slightly different sample), but nothing substantial. So I think the next step would be to take a look at the Rays pitchers.
by matthan on Jul 21, 2009 1:58 PM EDT up reply actions 0 recs
One issue.
So you wre using stricty percentages and not total strikes, swinging strikes, etc? Might it be a bit more illuminating if you used the actual amounts so a pitcher with 32 IP doesn’t weigh the same as one with 235?
by rglass44 on Jul 21, 2009 2:05 PM EDT reply actions 0 recs
I don't think this is helpful.
We’re talking about a rate metric here. Yes, the guys with 235 innings have more stable rates, but people should know this.
by R.J. Anderson on Jul 21, 2009 2:12 PM EDT up reply actions 0 recs
In building the model I only used qualified pitchers from 2003-2008
For 2009 I just used the model based upon the qualified pitchers. So the disparity of IP doesn’t really factor in. I’m sure some guys at low IP will have a higher error than guys with tons of IP and stable rates. Although just from looking at how it applies to 2009 it doesn’t appear the increase in error due to low amount of innings is that great.
by matthan on Jul 21, 2009 2:16 PM EDT up reply actions 0 recs
Do you not think it would get a better picture though?
It wouldn’t be hard to include, I wouldn’t think.
Not doubting the validity, but just thinking of ways to make it better.
by rglass44 on Jul 21, 2009 2:37 PM EDT up reply actions 0 recs
Sandy Kazmir mentioned something about Howell
If you look here JP has an expected K rate over 5% higher than anyone else on the Rays, Yanks or Red Sox. He is really really good and is doing exactly as expected
by matthan on Jul 21, 2009 10:49 PM EDT reply actions 0 recs
Yeah almost 30% is good
I can't help that I make some things look easier than they really are.
by Sandy Kazmir on Jul 21, 2009 10:55 PM EDT up reply actions 0 recs
Have you looked at which pitchers eK% is most ahead of their true K%?
Follow Me on Twitter @FreeZorilla
by FreeZorilla on Jul 22, 2009 9:48 AM EDT up reply actions 0 recs
Biggest expected decliners
Aardsma: 29.4% to 22.8%…drop of about 6.6%
Rafael Soriano: 34.7% to 28.8% drop of about 5.9%
Mariano Rivera: 30% to 24.75% about a 5.3% drop
Greinke: 25.5% to 20.3% about a 5.1% drop
by matthan on Jul 22, 2009 4:36 PM EDT up reply actions 0 recs
How awesome that JP is sustainable while aardsma and rivera are not
Follow Me on Twitter @FreeZorilla
by FreeZorilla on Jul 22, 2009 4:47 PM EDT up reply actions 0 recs
Biggest expected gainers
Mark Difelice: 22.8 to 30.7 about a 7.9% increase
Cla Meredith: 12.2 to 17.8 about a 5.6% increase
Ramon Ramirez: 17.5 to 22.3
Tommy Hanson-14.2 to 18.4
by matthan on Jul 22, 2009 4:38 PM EDT up reply actions 0 recs
Top overall eK%
1. Broxton 37.5%
2. Wuertz 35.1%
3. Mark Difelice 30.7%
4. Joe Nathan 30.16%
5. JP Howell 29.85%
..
..
..
9. Javy Vazquez 26.5%
10. Verlander 26.35%
by matthan on Jul 22, 2009 4:41 PM EDT reply actions 0 recs
Are we supposed to guess 6-8?
Follow Me on Twitter @FreeZorilla
by FreeZorilla on Jul 22, 2009 4:48 PM EDT up reply actions 0 recs
How significant were each of the independent variables?
Sweet, by the way.
Beyond the Boxscore Not a member? Sign up.
by Sky Kalkman on Jul 22, 2009 5:05 PM EDT reply actions 0 recs

by 


















