NBA Data Analysis

Daryl Morey is one of the coolest people in basketball. General manager of the Houston Rockets, he never played basektball even at the college level. Instead, he’s known for co-founding the MIT Sloan Sports Analytics Conference and pioneering the use of models and analytics in basketball. In this report, we will look at many facets of the NBA, including how offense and defense are correlated, how the game has sped up from 2000 to 2018, how the conferences are different from each other, and how we can try to predict the number of wins an NBA team will have based on advanced statistics about the team.

Let’s first set the years we want data from.

BEGIN = 2000
END = 2019

We will be getting most of our data from Basketball-Reference, which is already formatted well. All of the data will be downloaded as .csv, so the dataframe will simply be read in as .csv files. There won’t be any holes in the data, as all of these “advanced statistics” are tracked back to 1984. Only data from the 1999-2000 season and onwards will be used, as the pace of the game and use of the 3 point line has changed significantly since the 80s and 90s.

# Get basic team data from each season, loop through every of saved data from Basketball-Reference and get the most useful columns, clean the names portion of the data
results <- data.frame()
for (i in BEGIN:END) {
  index = i - (BEGIN - 1)
  temp <- read.csv(sprintf("%d_data.csv", i))
  # Remove the last row, which is league average
  temp <- head(temp, -1)
  
  # Remove the asterisk at the end of playoff team names
  temp$Team <- gsub("\\*$", "", temp$Team)
  temp$Year <- as.numeric(rep(i, nrow(temp)))

  # Merge dataframe
  results <- rbind(results, temp)
}
dim(results)

## [1] 595  29

names(results)

##  [1] "Rk"        "Team"      "Age"       "W"         "L"        
##  [6] "PW"        "PL"        "MOV"       "SOS"       "SRS"      
## [11] "ORtg"      "DRtg"      "NRtg"      "Pace"      "FTr"      
## [16] "X3PAr"     "TS."       "eFG."      "TOV."      "ORB."     
## [21] "FT.FGA"    "eFG..1"    "TOV..1"    "DRB."      "FT.FGA.1" 
## [26] "Arena"     "Attend."   "Attend..G" "Year"

attach(results)

We end up with 595 rows of data, each representing a season of a team, which should be more than enough to run some interesting analysis.

This year, many teams that have strong offenses also have strong defenses. To see how this holds up, we can do a correlation between offensive rating (based on points per possession) and defensive rating (based on points given up per possession). A higher offensive rating is better, and a lower defensive rating is better. We will do a correlation between the two and a bootstrapped correlation as well. First, we make a scatterplot.

# Scatterplot of the two
plot(ORtg, DRtg,
     main="Defensive Rating vs. Offensive Rating",
     xlab="Offensive Rating",
     ylab="Defensive Rating",
     pch=19,
     col="blue")

From the scatterplot, it seems like there is minimal correlation between offensive and defensive rating, although there might be some sort of a positive slope between the two. We can do the correlation test now.

# Correlation test
cor.test(ORtg, DRtg)

## 
##  Pearson's product-moment correlation
## 
## data:  ORtg and DRtg
## t = 3.8655, df = 593, p-value = 0.0001231
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07736796 0.23420220
## sample estimates:
##       cor 
## 0.1567733

The p-value of 0.0001 is significant for an alpha of .05. We get a correlation of approximatley 0.156, which seems to show that as offense gets better (and vice versa), defense gets slightly worse. This could be explained by the fact that teams that exert a lot of engery have offense have less left in the tank for defense. A prime example of this phenomenon is the 1983-84 Denver Nuggets. They scored a league-leading 123.7 points per game (which would still lead the league by 5 ppg last season), but gave up a league worst 124.8 points per game.

Adding the line yields the following scatterplot:

lm1 <- lm(ORtg ~ DRtg)
plot(ORtg, DRtg, 
     main="DRtg vs. ORtg with Regression Line",
     xlab="Offensive Rating",
     ylab="Defensive Rating",
     pch = 19, 
     col="blue")

#By default, abline() assumes we provide slope and intercept, which is what is contained in lm1$coef
abline(lm1$coef, col="red", lwd=3)

A bootstrapped correlation should give a similar result:

# Get number of rows
N <- length(DRtg)
n_samp <- 10000

corResults <- rep(NA, n_samp)
for (i in 1:n_samp) {
  s <- sample(1:N, N, replace=T)
  sVals <- as.numeric(names(table(s)))
  sCounts <- as.vector(table(s))
  
  fakeD <- rep(DRtg[sVals], sCounts)
  fakeO <- rep(ORtg[sVals], sCounts)
  
  cor1 <- cor(fakeD, fakeO)
  corResults[i] <- cor1
}

# Graph the correlations
ci_r <- quantile(corResults, c(.025, .975))
hist(corResults, col = "blue", main = "Bootstrapped Correlations", xlab = "Sample Correlation", breaks = 50)
abline(v = ci_r, lwd = 3, col = "red")
abline(v = cor.test(DRtg, ORtg)$conf.int,lwd = 3, col = "green", lty = 2)

The red lines show the 95% confidence interval for the bootstrap, while the green dotted lines show the 95% confidence interval for the correlation test. In this case, the bootstrapped confidence interval also doesn’t include 0, so it also found a statistically significant correlation between offense and defensive rating. Generally, the bootstrap is more flexible, and seeing as computers are getting more and more powerful, these tests are becoming more and more common.

Another feature to investigate is the pace of the game, which is roughly how fast possessions change. People have long argued that the game has sped up since 2000. It’s easy to check with a t-test on pace, which is an estimate of the number of possessions per 48 minutes.

pace_2000 <- results[results$Year == 2000, "Pace"]
pace_2019 <- results[results$Year == 2019, "Pace"]

(mean(pace_2019) - mean(pace_2000))

## [1] 6.947011

A difference of almost 7 possessions per game is absolutely significant since each possession can take up to 25 seconds and the entire game is only 48 minutes. We can do a t-test to find out. First, we can take a look at the distribution of the pace in both years.

hist(pace_2000,
     main="Histogram of Team Paces in 2000",
     xlab="Pace for Teams",
     breaks=10,
     col="red")

hist(pace_2019,
     main="Histogram of Team Paces in 2019",
     xlab="Pace for Teams",
     breaks=10,
     col="red")

With a sizeable sample of 30 teams, it seems like the assumptions for the t-test can be generally met (the Central Limit Theorem may work its magic a little). The distributions are generally symmetric and seem somehwat more clustered at center. We then run the t-test.

t.test(pace_2019, pace_2000)

## 
##  Welch Two Sample t-test
## 
## data:  pace_2019 and pace_2000
## t = 11.69, df = 55.569, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  5.756352 8.137671
## sample estimates:
## mean of x mean of y 
## 100.03667  93.08966

The t-test result has a p-value of almost 0, which definitively shows that the pace of the game has sped up from 2000 to 2019. In general, we could also run a non-parametric Kruskal-Wallis tests that requires no assumptions on the underlying distribution of the data.

kruskal.test(list(pace_2000, pace_2019))

## 
##  Kruskal-Wallis rank sum test
## 
## data:  list(pace_2000, pace_2019)
## Kruskal-Wallis chi-squared = 40.567, df = 1, p-value = 1.9e-10

In this case as well, we can safely conclude that there is strong evidence that the pace has increased from 2000 to 2019.

We can start off with a simple stepwise backwards regression. Beginning with all of the factors, we remove the least significant one until there are only significant factors left. The process is done manually, and we end up with a far more simplified model than the 19 original factors.

# Create temporary dataframe for multiple regression that removes useless variables and brings wins to the frong
results1 <- results[, c(-1,-2,-4, -5, -6, -7, -13, -26, -28, -29)]
results1 <- results1[, c(2:19, 1)]

# Original model with all predictors
lm1 <- lm(W ~ MOV + SOS + ORtg + DRtg + Pace + FTr + X3PAr + TS. + eFG. + TOV. + ORB. + FT.FGA + eFG..1 + TOV..1 + DRB. + FT.FGA.1 + Attend. +           Age, data=results1)
summary(lm1)

## 
## Call:
## lm(formula = W ~ MOV + SOS + ORtg + DRtg + Pace + FTr + X3PAr + 
##     TS. + eFG. + TOV. + ORB. + FT.FGA + eFG..1 + TOV..1 + DRB. + 
##     FT.FGA.1 + Attend. + Age, data = results1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.8814  -2.0088   0.1278   2.2419   7.1765 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.171e+02  4.956e+01  -2.362  0.01849 *  
## MOV          1.539e+00  8.045e-01   1.913  0.05623 .  
## SOS         -1.313e-02  3.805e-01  -0.035  0.97248    
## ORtg         9.985e-01  1.032e+00   0.967  0.33387    
## DRtg         5.542e-01  9.252e-01   0.599  0.54941    
## Pace         4.224e-02  5.714e-02   0.739  0.46010    
## FTr          1.144e+02  5.890e+01   1.941  0.05270 .  
## X3PAr       -4.793e+00  5.023e+00  -0.954  0.34033    
## TS.          4.494e+02  2.863e+02   1.570  0.11701    
## eFG.        -4.123e+02  2.399e+02  -1.719  0.08622 .  
## TOV.         1.548e-02  1.045e+00   0.015  0.98819    
## ORB.        -8.725e-02  3.991e-01  -0.219  0.82701    
## FT.FGA      -2.210e+02  1.216e+02  -1.817  0.06968 .  
## eFG..1      -2.179e+02  7.590e+01  -2.871  0.00425 ** 
## TOV..1       1.752e+00  6.785e-01   2.582  0.01006 *  
## DRB.         6.001e-01  2.473e-01   2.426  0.01557 *  
## FT.FGA.1    -3.948e+01  1.578e+01  -2.502  0.01262 *  
## Attend.      1.128e-05  1.668e-06   6.760 3.40e-11 ***
## Age          4.129e-01  9.714e-02   4.250 2.49e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.226 on 576 degrees of freedom
## Multiple R-squared:  0.9347, Adjusted R-squared:  0.9327 
## F-statistic: 458.3 on 18 and 576 DF,  p-value: < 2.2e-16

# After backwards stepwise regression
backwards <- lm(W ~ MOV + ORtg + eFG..1 + TOV..1 + DRB. + FT.FGA.1 + Attend. + Age, data=results1)
summary(backwards)

## 
## Call:
## lm(formula = W ~ MOV + ORtg + eFG..1 + TOV..1 + DRB. + FT.FGA.1 + 
##     Attend. + Age, data = results1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.3756  -2.0226   0.1157   2.2785   7.3638 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.922e+01  2.670e+01  -2.593 0.009759 ** 
## MOV          1.163e+00  3.803e-01   3.059 0.002322 ** 
## ORtg         1.265e+00  3.493e-01   3.622 0.000317 ***
## eFG..1      -1.866e+02  5.571e+01  -3.349 0.000863 ***
## TOV..1       1.492e+00  4.997e-01   2.986 0.002943 ** 
## DRB.         4.773e-01  1.725e-01   2.767 0.005831 ** 
## FT.FGA.1    -3.121e+01  1.158e+01  -2.695 0.007235 ** 
## Attend.      1.135e-05  1.656e-06   6.855 1.81e-11 ***
## Age          4.187e-01  9.059e-02   4.622 4.69e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.22 on 586 degrees of freedom
## Multiple R-squared:  0.9339, Adjusted R-squared:  0.933 
## F-statistic:  1034 on 8 and 586 DF,  p-value: < 2.2e-16

The backwards stepwise regression gives a model with 8 predictors iwth an R^2 of about 0.93, which is quite a good amount. We can try to use best subsets regression to find a better model. For this, we go up to a maximum model of 12 predictors.

library(leaps)

# Create temporary dataframe for best subsets regression that removes useless variables and brings wins to the frong
results1 <- results[, c(-1,-2,-4, -5, -6, -7, -13, -26, -28, -29)]
results1 <- results1[, c(2:19, 1)]

# Get best subsets results - max 12 variables
mod1 <- regsubsets(W ~ ., data=results1, nvmax=12)
(mod1sum <- summary(mod1))

## Subset selection object
## Call: (function(input_file, encoding) {
##     out_dir <- "docs"
##     rmarkdown::render(input_file, encoding = encoding, output_file = file.path(dirname(input_file), 
##         out_dir, "index.html"))
## })("/Users/18sunbri/Documents/Yale/Freshman/Second Semester/STAT 230/nba-wins/NBA Data Analysis.Rmd", 
##     encoding = "UTF-8")
## 19 Variables  (and intercept)
##          Forced in Forced out
## MOV          FALSE      FALSE
## SOS          FALSE      FALSE
## SRS          FALSE      FALSE
## ORtg         FALSE      FALSE
## DRtg         FALSE      FALSE
## Pace         FALSE      FALSE
## FTr          FALSE      FALSE
## X3PAr        FALSE      FALSE
## TS.          FALSE      FALSE
## eFG.         FALSE      FALSE
## TOV.         FALSE      FALSE
## ORB.         FALSE      FALSE
## FT.FGA       FALSE      FALSE
## eFG..1       FALSE      FALSE
## TOV..1       FALSE      FALSE
## DRB.         FALSE      FALSE
## FT.FGA.1     FALSE      FALSE
## Attend.      FALSE      FALSE
## Age          FALSE      FALSE
## 1 subsets of each size up to 12
## Selection Algorithm: exhaustive
##           MOV SOS SRS ORtg DRtg Pace FTr X3PAr TS. eFG. TOV. ORB. FT.FGA
## 1  ( 1 )  "*" " " " " " "  " "  " "  " " " "   " " " "  " "  " "  " "   
## 2  ( 1 )  "*" " " " " " "  " "  " "  " " " "   " " " "  " "  " "  " "   
## 3  ( 1 )  "*" " " " " " "  " "  " "  " " " "   " " " "  " "  " "  " "   
## 4  ( 1 )  "*" " " " " "*"  " "  " "  " " " "   " " " "  " "  " "  " "   
## 5  ( 1 )  "*" " " " " "*"  " "  " "  " " " "   " " " "  " "  " "  " "   
## 6  ( 1 )  "*" " " " " "*"  "*"  " "  " " " "   " " " "  " "  " "  " "   
## 7  ( 1 )  "*" " " " " " "  "*"  " "  " " " "   "*" " "  "*"  "*"  " "   
## 8  ( 1 )  "*" " " " " "*"  " "  " "  " " " "   " " " "  " "  " "  " "   
## 9  ( 1 )  "*" " " " " "*"  " "  " "  "*" " "   " " " "  " "  " "  " "   
## 10  ( 1 ) "*" " " " " "*"  " "  " "  "*" "*"   " " " "  " "  " "  " "   
## 11  ( 1 ) "*" " " " " "*"  " "  "*"  "*" "*"   " " " "  " "  " "  " "   
## 12  ( 1 ) "*" " " " " "*"  " "  " "  "*" " "   "*" "*"  " "  " "  "*"   
##           eFG..1 TOV..1 DRB. FT.FGA.1 Attend. Age
## 1  ( 1 )  " "    " "    " "  " "      " "     " "
## 2  ( 1 )  " "    " "    " "  " "      "*"     " "
## 3  ( 1 )  " "    " "    " "  " "      "*"     "*"
## 4  ( 1 )  " "    " "    " "  " "      "*"     "*"
## 5  ( 1 )  "*"    " "    " "  " "      "*"     "*"
## 6  ( 1 )  "*"    " "    " "  " "      "*"     "*"
## 7  ( 1 )  " "    " "    " "  " "      "*"     "*"
## 8  ( 1 )  "*"    "*"    "*"  "*"      "*"     "*"
## 9  ( 1 )  "*"    "*"    "*"  "*"      "*"     "*"
## 10  ( 1 ) "*"    "*"    "*"  "*"      "*"     "*"
## 11  ( 1 ) "*"    "*"    "*"  "*"      "*"     "*"
## 12  ( 1 ) "*"    "*"    "*"  "*"      "*"     "*"

Now we can determine which one of these is the best model with respect to measures including the Adjusted R-Squared and Bayesian information criterion (BIC).

First, with the Adjusted R-Squared.

#Fit this model and show results
AR <- results1[,mod1sum$which[which.max(mod1sum$adjr2),][-1]]
summary(lm(W ~ .,data=AR))

## 
## Call:
## lm(formula = W ~ ., data = AR)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4455  -1.9822   0.0797   2.2874   7.0068 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.048e+02  3.164e+01  -3.312 0.000984 ***
## MOV          1.105e+00  3.823e-01   2.891 0.003987 ** 
## ORtg         1.325e+00  3.608e-01   3.672 0.000262 ***
## FTr          1.140e+02  5.746e+01   1.984 0.047757 *  
## TS.          4.898e+02  2.668e+02   1.836 0.066909 .  
## eFG.        -4.370e+02  2.367e+02  -1.846 0.065376 .  
## FT.FGA      -2.256e+02  1.193e+02  -1.892 0.059040 .  
## eFG..1      -1.936e+02  5.589e+01  -3.463 0.000572 ***
## TOV..1       1.551e+00  5.017e-01   3.091 0.002091 ** 
## DRB.         5.185e-01  1.759e-01   2.947 0.003332 ** 
## FT.FGA.1    -3.514e+01  1.189e+01  -2.955 0.003255 ** 
## Attend.      1.125e-05  1.656e-06   6.794 2.69e-11 ***
## Age          4.164e-01  9.159e-02   4.546 6.65e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.216 on 582 degrees of freedom
## Multiple R-squared:  0.9344, Adjusted R-squared:  0.9331 
## F-statistic: 691.4 on 12 and 582 DF,  p-value: < 2.2e-16

Even though the adjusted R-squared result is adjusted for the number of predictors in the model, it seems like it still chose the subset of regressors that was the largest, which in this case was 12. We can try the BIC instead, and see if it selects a model that is smaller and easier to interpret.

#Fit the model with the best bic
AR <- results1[, mod1sum$which[which.min(mod1sum$bic),][-1]]
mod2 <- lm(W ~ ., data=AR)
summary(mod2)

## 
## Call:
## lm(formula = W ~ ., data = AR)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.8061  -2.0611   0.0566   2.3441   7.2249 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.166e+01  2.487e+00   8.709  < 2e-16 ***
## MOV         2.473e+00  3.571e-02  69.252  < 2e-16 ***
## Attend.     1.183e-05  1.603e-06   7.379 5.43e-13 ***
## Age         3.946e-01  8.891e-02   4.438 1.08e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.255 on 591 degrees of freedom
## Multiple R-squared:  0.9318, Adjusted R-squared:  0.9315 
## F-statistic:  2692 on 3 and 591 DF,  p-value: < 2.2e-16

The three variables that remain have p values of almost 0, and the model accounts for 93 of the variance of wins around the mean. The distinct three variables are interesting. * Attendance can be seen as a proxy variable for the size of the market that the team is in. It’s always been a trend that top free agents often head towards the largest markets, as they increase their marketability there. This regression seems to confirm such a bias. * Age being positive is also unsurprising. It is rare for young teams to perform well as most of the players are inexperienced (save for a few exceptions like OKC 7-8 years ago). Teams usually need some sort of veteran presence (33+ years old) to perform well, and most championship teams had their fair share of older people. * MOV stands for margin of victory, which is the average amount a team beat its opponents by in a season. This is a good measure of how strong a team is in general; it probably takes into account other variables like offensive rating, defensive rating, true shooting percentage, turnover percent, etc. It’s not surprising at all that this statistic is in there.

If we didn’t care as much about inference, we could try to use ridge regression and the LASSO to regularize the coefficients so that they could be used on new datasets. We could see how our model works with the 2019-2020 season. I will only choose some of the possible factors for ridge regression - the code is more or less proof of concept.

# Used for ridge regression
library(glmnet)

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-18

# Generate a list of hyperparameters for L2 regularization
lambdas <- 10^seq(3, -2, by = -.1)

# cv.glmnet() finds the best lambda through checking all of the values we provided via cross-validation
fit <- cv.glmnet(as.matrix(results1[, c(1, 4, 10, 15, 16, 17, 18, 19)]), W, alpha=0, lambda=lambdas)

# This gives a collection of all of the fit models
summary(fit$glmnet.fit)

##           Length Class     Mode   
## a0         51    -none-    numeric
## beta      408    dgCMatrix S4     
## df         51    -none-    numeric
## dim         2    -none-    numeric
## lambda     51    -none-    numeric
## dev.ratio  51    -none-    numeric
## nulldev     1    -none-    numeric
## npasses     1    -none-    numeric
## jerr        1    -none-    numeric
## offset      1    -none-    logical
## call        5    -none-    call   
## nobs        1    -none-    numeric

# Get R^2
(optimal <- fit$lambda.min)

## [1] 0.01

y <- predict(fit, s=optimal, newx=as.matrix(results1[, c(1, 4, 10, 15, 16, 17, 18, 19)]))
sst <- sum((y - mean(y))^2)
sse <- sum((W - y)^2)
(rsq <- 1 - sse / sst)

## [1] 0.9276295

# Get the coefficients
(coeffs <- coef(fit, s=optimal))

## 9 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  1.336907e+01
## MOV          2.416514e+00
## ORtg         1.362855e-01
## eFG.        -4.375503e+00
## TOV..1      -8.462039e-02
## DRB.        -5.611834e-02
## FT.FGA.1     2.464881e+00
## Attend.      1.137122e-05
## Age          4.331551e-01

In this case, it seems that the R^2 value on the training data has not really changed, mostly because the optimal regularization lambda was so close to 0. In other contexts with a larger lambda, more shrinkage of the parameters would probably decrease R^2 somehwat. Looking at the coefficients, since the optimal lambda is so low, they aren’t that much lower than what they were for the backwards stepwise regression (both of these models used the same factors).

Now, we will try run the lasso, which uses L1 regularization. With the lasso, it’s possible to conclude that certain betas should be equal to zero, essentially taking out factors form the model. Most of the code will be extremely similar to that of before, except that we can set alpha=1 as that’s the coefficient of the L1 term in the glmnet().

# Generate a list of hyperparameters for L2 regularization
lambdas <- 10^seq(3, -2, by = -.1)

# cv.glmnet() finds the best lambda through checking all of the values we provided via cross-validation
fit <- cv.glmnet(as.matrix(results1), W, alpha=1, lambda=lambdas)

# This gives a collection of all of the fit models
summary(fit$glmnet.fit)

##           Length Class     Mode   
## a0         51    -none-    numeric
## beta      969    dgCMatrix S4     
## df         51    -none-    numeric
## dim         2    -none-    numeric
## lambda     51    -none-    numeric
## dev.ratio  51    -none-    numeric
## nulldev     1    -none-    numeric
## npasses     1    -none-    numeric
## jerr        1    -none-    numeric
## offset      1    -none-    logical
## call        5    -none-    call   
## nobs        1    -none-    numeric

# Get R^2
(optimal <- fit$lambda.min)

## [1] 0.1

y <- predict(fit, s=optimal, newx=as.matrix(results1))
sst <- sum((y - mean(y))^2)
sse <- sum((W - y)^2)
(rsq <- 1 - sse / sst)

## [1] 0.9264147

# Get the coefficients
(coeffs <- coef(fit, s=optimal))

## 20 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  1.135528e+01
## MOV          2.377203e+00
## SOS          .           
## SRS          .           
## ORtg         1.232150e-01
## DRtg         .           
## Pace         .           
## FTr          2.249367e+00
## X3PAr        .           
## TS.          .           
## eFG.         .           
## TOV.         .           
## ORB.         .           
## FT.FGA       3.955437e-01
## eFG..1      -5.281736e+00
## TOV..1       .           
## DRB.         .           
## FT.FGA.1     .           
## Attend.      1.063569e-05
## Age          3.917975e-01

We can see that in this case, the lasso selected the following factors: MOV, ORtg, FTr, FT.FGA, Attend., and Age. Recall that the best model from best subsets regression contained MOV, Attend., and Age; all three were selected by the lasso. For good predictive power, this model would probably used for test data.

Another regression method extremely similar to the lasso is the least-angle regression. The algorithm begins with all betas set to 0, and in general, the betas that are most correlated with the residuals are slowly increased, until all predictors are included in the model.

library(lars)

## Loaded lars 1.2

lars_res <- lars(as.matrix(results1), W, type="lar")
summary(lars_res)

## LARS/LAR
## Call: lars(x = as.matrix(results1), y = W, type = "lar")
##    Df   Rss       Cp
## 0   1 91845 8241.038
## 1   2  8839  259.142
## 2   3  7674  149.089
## 3   4  6715   58.840
## 4   5  6658   55.356
## 5   6  6271   20.201
## 6   7  6227   17.968
## 7   8  6216   18.914
## 8   9  6182   17.572
## 9  10  6181   19.465
## 10 11  6163   19.795
## 11 12  6161   21.584
## 12 13  6134   20.983
## 13 14  6133   22.901
## 14 15  6126   24.182
## 15 16  6032   17.198
## 16 17  6031   19.106
## 17 18  6014   19.437
## 18 19  6007   20.764
## 19 20  5978   20.000

(lars_res$R2)

##         0         1         2         3         4         5         6 
## 0.0000000 0.9037651 0.9164494 0.9268919 0.9275127 0.9317186 0.9321977 
##         7         8         9        10        11        12        13 
## 0.9323171 0.9326954 0.9327075 0.9328965 0.9329203 0.9332147 0.9332241 
##        14        15        16        17        18        19 
## 0.9333055 0.9343225 0.9343329 0.9345218 0.9345980 0.9349109

(lars_res$beta)

##          MOV      SOS           SRS       ORtg      DRtg       Pace
## 0   0.000000  0.00000   0.000000000 0.00000000 0.0000000 0.00000000
## 1   2.275677  0.00000   0.000000000 0.00000000 0.0000000 0.00000000
## 2   2.351397  0.00000   0.000000000 0.00000000 0.0000000 0.00000000
## 3   2.404304  0.00000   0.000000000 0.00000000 0.0000000 0.00000000
## 4   2.399787  0.00000   0.009247839 0.00000000 0.0000000 0.00000000
## 5   2.413614  0.00000  -0.004190222 0.06839784 0.0000000 0.00000000
## 6   2.414713  0.00000  -0.007938758 0.08146705 0.0000000 0.00000000
## 7   2.415251  0.00000  -0.009762722 0.08662582 0.0000000 0.00000000
## 8   2.395381  0.00000  -0.037981331 0.14540661 0.0000000 0.00000000
## 9   2.395481  0.00000  -0.039484628 0.14749358 0.0000000 0.00000000
## 10  2.389438  0.00000  -0.056245003 0.17629131 0.0000000 0.00000000
## 11  2.387099  0.00000  -0.056394315 0.17471652 0.0000000 0.00000000
## 12  2.376129  0.00000  -0.077669007 0.16589039 0.0000000 0.02616903
## 13  2.375870  0.00000  -0.079062086 0.16583187 0.0000000 0.02718247
## 14  2.353046  0.00000  -0.086048143 0.18159998 0.0000000 0.03180036
## 15  1.383168  0.00000  -0.059110004 1.04366158 0.0000000 0.04357883
## 16  1.351380  0.00000  -0.057675533 1.04834657 0.0000000 0.04392255
## 17  1.291465  0.00000  -0.048594252 1.11351596 0.0000000 0.04389211
## 18  1.343398  0.00000  -0.043144554 1.09138788 0.1074586 0.04359463
## 19 31.313163 29.78207 -29.827505144 1.06715022 0.5252641 0.04122264
##            FTr       X3PAr         TS.      eFG.        TOV.         ORB.
## 0    0.0000000  0.00000000   0.0000000    0.0000  0.00000000  0.000000000
## 1    0.0000000  0.00000000   0.0000000    0.0000  0.00000000  0.000000000
## 2    0.0000000  0.00000000   0.0000000    0.0000  0.00000000  0.000000000
## 3    0.0000000  0.00000000   0.0000000    0.0000  0.00000000  0.000000000
## 4    0.0000000  0.00000000   0.0000000    0.0000  0.00000000  0.000000000
## 5    0.0000000  0.00000000   0.0000000    0.0000  0.00000000  0.000000000
## 6    0.0000000  0.00000000   0.0000000    0.0000  0.00000000  0.000000000
## 7    0.9515168  0.00000000   0.0000000    0.0000  0.00000000  0.000000000
## 8    3.2692198  0.00000000   0.0000000    0.0000  0.00000000  0.000000000
## 9    3.3600133 -0.03623664   0.0000000    0.0000  0.00000000  0.000000000
## 10   5.1203236 -0.54800829   0.0000000    0.0000 -0.03279755  0.000000000
## 11   5.4262191 -0.65263144   0.8974498    0.0000 -0.04210829  0.000000000
## 12   8.8150794 -2.28664468   9.3195131    0.0000 -0.14430750  0.000000000
## 13   8.9570875 -2.35789597   9.6516710    0.0000 -0.14811193  0.000000000
## 14   9.7588550 -2.57992011  12.0266465    0.0000 -0.17451506  0.000000000
## 15  13.9756141 -4.00446530  13.3126754    0.0000 -0.20285408  0.000000000
## 16  13.9669954 -3.94222052  17.2159712    0.0000 -0.23587049  0.012472295
## 17  44.5076530 -4.00703369 158.0899318 -127.8779 -0.21043451  0.003418569
## 18  58.0584613 -4.16131821 214.5824400 -183.0619 -0.16637821 -0.014209118
## 19 116.4416227 -5.10042342 455.4635569 -420.1211  0.03803125 -0.096340893
##         FT.FGA      eFG..1     TOV..1      DRB.     FT.FGA.1      Attend.
## 0     0.000000    0.000000 0.00000000 0.0000000   0.00000000 0.000000e+00
## 1     0.000000    0.000000 0.00000000 0.0000000   0.00000000 0.000000e+00
## 2     0.000000    0.000000 0.00000000 0.0000000   0.00000000 3.695926e-06
## 3     0.000000    0.000000 0.00000000 0.0000000   0.00000000 7.223517e-06
## 4     0.000000    0.000000 0.00000000 0.0000000   0.00000000 7.523463e-06
## 5     0.000000    0.000000 0.00000000 0.0000000   0.00000000 9.859434e-06
## 6     2.214310    0.000000 0.00000000 0.0000000   0.00000000 1.029147e-05
## 7     1.650658    0.000000 0.00000000 0.0000000   0.00000000 1.041917e-05
## 8    -1.031976   -8.244214 0.00000000 0.0000000   0.00000000 1.068495e-05
## 9    -1.167341   -8.456929 0.00000000 0.0000000   0.00000000 1.069502e-05
## 10   -3.234818  -12.152363 0.00000000 0.0000000   0.00000000 1.084348e-05
## 11   -3.599360  -12.601009 0.00000000 0.0000000   0.00000000 1.085321e-05
## 12   -7.414763  -19.145940 0.00000000 0.0000000   0.00000000 1.099257e-05
## 13   -7.554053  -19.450482 0.00000000 0.0000000  -0.08021262 1.100023e-05
## 14   -8.393962  -24.019157 0.04027553 0.0000000  -1.33386034 1.110116e-05
## 15  -12.585509 -161.700501 1.26792055 0.4235655 -28.41060547 1.124560e-05
## 16  -12.523441 -166.061112 1.30704400 0.4365487 -29.26208083 1.125285e-05
## 17  -76.529161 -173.807009 1.37052499 0.4603023 -30.78545398 1.124041e-05
## 18 -104.551798 -182.352345 1.44444345 0.4873861 -32.47276771 1.124753e-05
## 19 -225.510445 -221.184212 1.80217773 0.6106848 -40.79306874 1.131474e-05
##          Age
## 0  0.0000000
## 1  0.0000000
## 2  0.0000000
## 3  0.1712203
## 4  0.1857816
## 5  0.3428698
## 6  0.3704174
## 7  0.3793758
## 8  0.3965817
## 9  0.3971469
## 10 0.4048872
## 11 0.4046308
## 12 0.4115422
## 13 0.4117050
## 14 0.4103095
## 15 0.4169288
## 16 0.4171108
## 17 0.4151222
## 18 0.4147042
## 19 0.4146079
## attr(,"scaled:scale")
##  [1] 1.099136e+02 9.552171e+00 1.070545e+02 9.235252e+01 8.814477e+01
##  [6] 8.361948e+01 9.358254e-01 1.711328e+00 5.164501e-01 5.718057e-01
## [11] 2.565571e+01 7.581284e+01 6.922957e-01 5.316770e-01 2.678100e+01
## [16] 6.509472e+01 7.011162e-01 2.251848e+06 4.245210e+01

From the R2 and betas, we see that MOV is a very strong predictor of wins (which should be the case). For the lasso and best subsets, we could only tell that the best results came from a combination of MOV, Age, and Attendance. In this case, we can see that MOV immediately can explain up to 90% of the variation in wins.

For another way to visualize the data, we can first do a Principal Components Analysis (PCA). In essence, PCA applies a linear transformation to the data such that an orthogonal basis is formed such that each principal component captures as much variance in the data as possible. For physics people, think of the principal axis theorem. To do so, we drop the non-numerical data and other redundant columns.

pca_df <- results[, c(-1, -2, -5, -6, -7, -8, -26, -28, -29)]
pca <- prcomp(pca_df, center=TRUE, scale=TRUE)
summary(pca)

## Importance of components:
##                          PC1   PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.538 2.227 1.4244 1.06346 0.97783 0.94579 0.89123
## Proportion of Variance 0.322 0.248 0.1014 0.05655 0.04781 0.04473 0.03971
## Cumulative Proportion  0.322 0.570 0.6714 0.72798 0.77579 0.82051 0.86023
##                            PC8     PC9    PC10    PC11    PC12    PC13
## Standard deviation     0.84012 0.71768 0.69726 0.62684 0.57550 0.50243
## Proportion of Variance 0.03529 0.02575 0.02431 0.01965 0.01656 0.01262
## Cumulative Proportion  0.89552 0.92127 0.94558 0.96522 0.98178 0.99441
##                           PC14    PC15    PC16    PC17    PC18    PC19
## Standard deviation     0.23334 0.22229 0.07668 0.03244 0.02931 0.01469
## Proportion of Variance 0.00272 0.00247 0.00029 0.00005 0.00004 0.00001
## Cumulative Proportion  0.99713 0.99960 0.99989 0.99995 0.99999 1.00000
##                             PC20
## Standard deviation     1.696e-15
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00

We can plot the two largest principal components and see where the points land.

library(devtools)

## Loading required package: usethis

install_github("vqv/ggbiplot")

## Skipping install of 'ggbiplot' from a github remote, the SHA1 (7325e880) has not changed since last install.
##   Use `force = TRUE` to force installation

library(ggbiplot)

## Loading required package: ggplot2

## Loading required package: plyr

## Loading required package: scales

## Loading required package: grid

ggbiplot(pca, ellipse=TRUE, group=results[, "Team"])

It’s difficult to see much on the graph, but it’s important as we’ve seen earlier that approximately 6-7 principal components can cover over 80% of variance.

Using the principal components, we can do principal components regression, where we choose some of the principal components as regressors. In this way, we can achieve dimensionality reduction , avoid multicollinearity, and cause reduce overfitting (theoretically).

require(pls)

## Loading required package: pls

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

pcr_model <- pcr(W ~ ., data=results1, scale=TRUE, validation="CV")
summary(pcr_model)

## Data:    X dimension: 595 19 
##  Y dimension: 595 1
## Fit method: svdpc
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           12.45    11.13    3.737    3.721    3.683    3.464    3.452
## adjCV        12.45    11.12    3.734    3.719    3.680    3.451    3.449
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       3.412    3.261    3.258     3.263     3.272     3.272     3.276
## adjCV    3.409    3.257    3.253     3.259     3.268     3.268     3.272
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        3.280     3.281     3.277     3.282     3.283     3.281
## adjCV     3.276     3.276     3.272     3.277     3.278     3.275
## 
## TRAINING: % variance explained
##    1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X    32.22    55.12    65.79    71.73    76.72    81.43    85.60    89.29
## W    20.63    91.05    91.20    91.48    92.56    92.57    92.77    93.33
##    9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X    92.00     94.56     96.62     98.37     99.69     99.96     99.99
## W    93.35     93.35     93.36     93.38     93.39     93.39     93.41
##    16 comps  17 comps  18 comps  19 comps
## X    100.00    100.00    100.00    100.00
## W     93.44     93.44     93.47     93.49

# Plot cross validation MSE and R^2
validationplot(pcr_model, val.type="MSEP")

validationplot(pcr_model, val.type="R2")

In this case we can see that we really only need 3 principal components to capture a majority of the variance of the data, and that adding more principal components does little at the expense of a far more complex model.

We could also try to run some non-parametric regressions to maximize our predictive power. In this case, we will take a look at many methods with splines. We will just use one parameter, Age, which was one of the most significant predictors when used for best subsets regression. We use first do a cubic spline the smoothing splines function, which means that we won’t need to select the locations of the knots if we used a function like cubic splines.

library(splines)
spline_res <- smooth.spline(Age, W, cv=TRUE)

## Warning in smooth.spline(Age, W, cv = TRUE): cross-validation with non-
## unique 'x' values seems doubtful

spline_res

## Call:
## smooth.spline(x = Age, y = W, cv = TRUE)
## 
## Smoothing Parameter  spar= 0.8294247  lambda= 0.02881927 (12 iterations)
## Equivalent Degrees of Freedom (Df): 4.732098
## Penalized Criterion (RSS): 8141.522
## PRESS(l.o.o. CV): 111.8497

# It selected lambda = 0.01, df = 5.9, plot it
plot(Age, W, col="grey")
lines(spline_res, lwd=2, col="purple")
legend("topright", ("Smoothing Splines with 6.78 df selected by CV"), col="purple", lwd=2)

We see that when regressing Age on wins, the spline fines a non-lienar fit that tapers off at the end. This could be a little better than a simple linear regression, but the improvement in this case sholdn’t be much.

NBA Data Analysis

Brian Sun

9/7/2019