There is a timeless debate in the hockey analytics community regarding the use of singular metrics (GF%, xGF%, DTMAboutHeart’s WAR, Emmanuel Perry’s K or WAR, Dom Luszczyszyn's GameScore, etc) to quantify the result of a player’s performance in one all-encompassing number. There is also an equally extensive debate on how to evaluate these measurements while acknowledging environmental factors that influence a player’s on-ice result. Hockey has a notoriously high number of inputs that carry equally high degrees of complexity. How should we address a scenario where two players achieve the same end result under drastically different circumstances?
## Methodology
I contest that we should give a slight advantage to those who were able to produce in a more difficult environment. To illustrate, let’s examine two players producing the same result with drastically different environmental factors:
As shown above, Mikhail Sergachev, Defenseman for the Tampa Bay Lightning, and Roman Polak, Defenseman for the Toronto Maple Leafs, are both having themselves almost identically exceptional years so far by virtue of Expected Goals. However, Sergachev sees far more zone starts in the offensive zone than Polak (70% compared to 45% ZSR%), has higher Quality of Teammates than Polak (53.44 compared to 50.52 xGF% QoT), and faces worse competition than Polak (49.39 to 50.10 xGF% QoT). Simply put, Sergachev plays easier minutes than Polak, but they achieve the same result. I deem it rational to give players like Polak a bit more credit than players like Sergachev for producing equal results in a tougher environment, hence the decision to create this adjustment method.
Now to be clear, I am far from the first to attempt adjustment methods. There are plenty of examples of approaches found on hockeygraphs, Corsica, hockeyviz, and others and I encourage you to read those works as well. However, my rendition is slightly different and produces different weightings and results. First and foremost, let’s define some terms I’ll be using quite frequently throughout this article. If you’re familiar with hockey advanced stats and basic statistical procedures, feel free to skip over this section. TOI : Time on Ice – The sample size of the players’ outputs measured by minutes played.xGF% : Expected Goals For Percentage – the share of total Expected Goals by either team that are scored by the given player’s team while he is on the ice. Expected Goals is a concept created by Emmanuel Perry considering shot type, quality, and some circumstantial data to determine the likelihood of a goal. More can be found here on this concept. I will be using this stat as the dependent variable representing a player’s performance.ZSR% : Zone Start Rate Percentage – The percentage of a player’s zone starts that are in the Offensive Zone, with Offensive Zone starts being more conducive to success than Defensive Zone starts. (I think you would agree it is easier to put pucks in the opponent’s net and keep them out of your own when you’re standing in front of the opponent’s goalie). This stat is one of the three contextual independent variables we will be considering.xGF% QoT : Expected Goals For Quality of Teammates – Using xGF% to measure the quality of a player’s teammates. This stat is the second contextual independent variable we will be considering.xGF% QoC : Expected Goals For Quality of Competition – Using xGF% to measure the quality of a player’s opponents. This stat is the third and final contextual independent variable we will be considering.Multiple Linear Regression : statistical technique that uses several explanatory variables to predict the outcome of a response variable (Investopedia)Weighted Average : mean in which each quantity to be averaged is assigned a weight, and these weightings determine the relative importance of each quantity on the average (investopedia)--- Welcome back, for some of you. Now that we’re all on the same page, let’s dive into the statistics. The sample of data I used for this analysis was 1,775 observations of data from the past 3 seasons (2015-16 through 2017-18) having a minimum of 300 minutes of 5v5 time on ice. With this sample, I ran a multiple linear regression to predict what percentage of change in a player’s xGF% (dependent result variable) can be explained by their ZSR%, xGF% QoT, and xGF% QoT (independent input variables). This will provide a quantitative measure of how much influence each contextual stat has on a player’s performance. The results are summarized below:
A Quick Interpretation:
Multiple R: Shows the Correlational Coefficient of our Regression. A perfect score is 1.0 which is highly unlikely to find in non-theoretical data .43 is a strong score.Adjusted R Square: This is the Coefficient of Determination for regressions with more than one independent variable. This can be interpreted so say 18% of the variation in xGF% can be explained by the three components we’ve selected. 1.0 is a perfect score. This is a fair measure.P-Value: Used to measure statistical significance of each input variable. Each of our inputs is statistically significant because the p-value for each of our independent variables is below .05 (assuming 95% confidence interval) The measure of how much xGF% increases with the marginal increase of each input variable. For example, for every one unit increase in ZSR%, you can expect a player’s xGF% to increase by .09 points.Coefficients: By examining the readout of the coefficients, we’re able to learn a few things: - More offensive zone starts lead to a higher xGF% (Shown by the positive coefficient for ZSR)
- Better teammates lead to a higher xGF% (Shown by the positive coefficient for xGF QoT)
- Better competition leads to a lower xGF% (Shown by the negative coefficient for xGF QoC)
- The quality of competition is the most impactful contextual metric, followed by quality of teammates, and then zone start percentage. (Shown by the absolute value of each coefficient)
- Reward players who have fewer offensive zone starts than average
- Reward players who have a lower quality of teammate than average
- Reward players who have a higher quality of competition than average
I have created a “Percent of Change” column that isn’t normally provided in regression results. This just shows the percentage of overall change in xGF% that each metric would contribute if each was increased by one unit. In other words, ZSR% would be responsible for 5% of the change in xGF% while xGF QoT% is responsible for 31% and xGF% QoC is responsible for 63% if each stat increased by one unit. Each of these measures also represents the weights that we will give each metric. Our weighted average formula works in 3 steps:
**Step 1:**Take the difference from the mean for each stat (let’s refer to this as*n*) in a way that is consistent with our 3 cardinal objectives. For behaviors we want to reward, we must calculate the difference so that a positive difference is a positive outcome. For example, the average ZSR% in our sample is 49.9%. Say Player X has a ZSR of 40%. Player X is offensively utilized almost 10% less than average, so we will help him out a bit. 49.9 - 40 = 9.9. For Player X's ZSR component, n = 9.9.**Step 2:**Discount or inflate the player’s xGF% by 1+*n.*We add 1 to our*n*because this will inflate a player's independent component if it is positively rewarded in step 1 or deflate a player's independent component if it is negatively punished in step 1. In player X's case, we'll see 1+.099 = 1.099. Essentially, we'll be boosting his xGF% by about 10% because if this theoretical player's disadvantageous zone deployment.**Step 3:**Multiply by that stat’s Percentage of Change value. Since we know ZSR% is responsible for 5% of all change in xGF% if each input was increased by one unit, we will assign a weight of 5% to ZSR% in our weighted average. In Player X's case, we'll end up calculating: (Original xGF%) x (1+9.9 ZSR% adjustment towards mean) x (.05 weight of ZSR% input). However, we have three inputs so we'll be doing the same calculations for ZSR%, xGF% QoC, and xGF% QoT and adding them together. The result will be our adjusted xGF%.
So how does this adjustment effect the way we evaluate Sergachev relative to Polak? Sergachev’s clear advantages brought his adjusted xGF% down from 54.92 to 53.53 (-1.39) while Polak gets a slight boost by .09 to 54.99. ## Results
I've been sitting on this article for about two weeks unsure of whether I should post it or not/too preoccupied at my day job to address some of the concerns I have with this piece. My main concern is that I can't yet prove if this adjustment method produces a metric that is more predictive and therefore more accurate in evaluating a player's true performance ability than xGF% itself. In order to test this adjustment method, I would need split season data so as to keep as many variables equal as possible, and as of 1/27/18, Corsica's custom query functionality is not back yet.
There are too many variables that are unaccounted for from season to season that I don't feel comfortable evaluating validity across previous seasons. My methodology accounts for just 3 of the countless inputs for which we mostly do not quantify yet - As I said, hockey is notoriously difficult to quantify and predict. I decided to post this and use the data I gathered now as my first half population and gather a sample at the end of this season to test split-season validity. Only then will I test the validity of this adjustment method. I'll keep you posted. For now, I’ve included a dashboard below allowing you to sift through the data by player, team, position, and season to see how different players have been adjusted:
0 Comments
## Leave a Reply. |
## Categories |