Using Multiple Linear Regression to Create a Weighted Average Adjusted xGF% Considering Contextual Factors
There is a timeless debate in the hockey analytics community regarding the use of singular metrics (GF%, xGF%, DTMAboutHeart’s WAR, Emmanuel Perry’s K or WAR, Dom Luszczyszyn's GameScore, etc) to quantify the result of a player’s performance in one all-encompassing number. There is also an equally extensive debate on how to evaluate these measurements while acknowledging environmental factors that influence a player’s on-ice result. Hockey has a notoriously high number of inputs that carry equally high degrees of complexity. How should we address a scenario where two players achieve the same end result under drastically different circumstances?
I contest that we should give a slight advantage to those who were able to produce in a more difficult environment. To illustrate, let’s examine two players producing the same result with drastically different environmental factors:
As shown above, Mikhail Sergachev, Defenseman for the Tampa Bay Lightning, and Roman Polak, Defenseman for the Toronto Maple Leafs, are both having themselves almost identically exceptional years so far by virtue of Expected Goals. However, Sergachev sees far more zone starts in the offensive zone than Polak (70% compared to 45% ZSR%), has higher Quality of Teammates than Polak (53.44 compared to 50.52 xGF% QoT), and faces worse competition than Polak (49.39 to 50.10 xGF% QoT). Simply put, Sergachev plays easier minutes than Polak, but they achieve the same result. I deem it rational to give players like Polak a bit more credit than players like Sergachev for producing equal results in a tougher environment, hence the decision to create this adjustment method.
Now to be clear, I am far from the first to attempt adjustment methods. There are plenty of examples of approaches found on hockeygraphs, Corsica, hockeyviz, and others and I encourage you to read those works as well. However, my rendition is slightly different and produces different weightings and results.
First and foremost, let’s define some terms I’ll be using quite frequently throughout this article. If you’re familiar with hockey advanced stats and basic statistical procedures, feel free to skip over this section.
TOI : Time on Ice – The sample size of the players’ outputs measured by minutes played.
xGF% : Expected Goals For Percentage – the share of total Expected Goals by either team that are scored by the given player’s team while he is on the ice. Expected Goals is a concept created by Emmanuel Perry considering shot type, quality, and some circumstantial data to determine the likelihood of a goal. More can be found here on this concept. I will be using this stat as the dependent variable representing a player’s performance.
ZSR% : Zone Start Rate Percentage – The percentage of a player’s zone starts that are in the Offensive Zone, with Offensive Zone starts being more conducive to success than Defensive Zone starts. (I think you would agree it is easier to put pucks in the opponent’s net and keep them out of your own when you’re standing in front of the opponent’s goalie). This stat is one of the three contextual independent variables we will be considering.
xGF% QoT : Expected Goals For Quality of Teammates – Using xGF% to measure the quality of a player’s teammates. This stat is the second contextual independent variable we will be considering.
xGF% QoC : Expected Goals For Quality of Competition – Using xGF% to measure the quality of a player’s opponents. This stat is the third and final contextual independent variable we will be considering.
Multiple Linear Regression : statistical technique that uses several explanatory variables to predict the outcome of a response variable (Investopedia)
Weighted Average : mean in which each quantity to be averaged is assigned a weight, and these weightings determine the relative importance of each quantity on the average (investopedia)
Welcome back, for some of you. Now that we’re all on the same page, let’s dive into the statistics. The sample of data I used for this analysis was 1,775 observations of data from the past 3 seasons (2015-16 through 2017-18) having a minimum of 300 minutes of 5v5 time on ice. With this sample, I ran a multiple linear regression to predict what percentage of change in a player’s xGF% (dependent result variable) can be explained by their ZSR%, xGF% QoT, and xGF% QoT (independent input variables). This will provide a quantitative measure of how much influence each contextual stat has on a player’s performance. The results are summarized below:
A Quick Interpretation:
Multiple R: Shows the Correlational Coefficient of our Regression. A perfect score is 1.0 which is highly unlikely to find in non-theoretical data .43 is a strong score.
Adjusted R Square: This is the Coefficient of Determination for regressions with more than one independent variable. This can be interpreted so say 18% of the variation in xGF% can be explained by the three components we’ve selected. 1.0 is a perfect score. This is a fair measure.
P-Value: Used to measure statistical significance of each input variable. Each of our inputs is statistically significant because the p-value for each of our independent variables is below .05 (assuming 95% confidence interval)
Coefficients: The measure of how much xGF% increases with the marginal increase of each input variable. For example, for every one unit increase in ZSR%, you can expect a player’s xGF% to increase by .09 points.
By examining the readout of the coefficients, we’re able to learn a few things:
I have created a “Percent of Change” column that isn’t normally provided in regression results. This just shows the percentage of overall change in xGF% that each metric would contribute if each was increased by one unit. In other words, ZSR% would be responsible for 5% of the change in xGF% while xGF QoT% is responsible for 31% and xGF% QoC is responsible for 63% if each stat increased by one unit. Each of these measures also represents the weights that we will give each metric. Our weighted average formula works in 3 steps:
So how does this adjustment effect the way we evaluate Sergachev relative to Polak? Sergachev’s clear advantages brought his adjusted xGF% down from 54.92 to 53.53 (-1.39) while Polak gets a slight boost by .09 to 54.99.
I've been sitting on this article for about two weeks unsure of whether I should post it or not/too preoccupied at my day job to address some of the concerns I have with this piece. My main concern is that I can't yet prove if this adjustment method produces a metric that is more predictive and therefore more accurate in evaluating a player's true performance ability than xGF% itself. In order to test this adjustment method, I would need split season data so as to keep as many variables equal as possible, and as of 1/27/18, Corsica's custom query functionality is not back yet.
There are too many variables that are unaccounted for from season to season that I don't feel comfortable evaluating validity across previous seasons. My methodology accounts for just 3 of the countless inputs for which we mostly do not quantify yet - As I said, hockey is notoriously difficult to quantify and predict. I decided to post this and use the data I gathered now as my first half population and gather a sample at the end of this season to test split-season validity. Only then will I test the validity of this adjustment method. I'll keep you posted.
For now, I’ve included a dashboard below allowing you to sift through the data by player, team, position, and season to see how different players have been adjusted: