Daryl Morey is one of the coolest people in basketball. General manager of the Houston Rockets, he never played basektball even at the college level. Instead, he’s known for co-founding the MIT Sloan Sports Analytics Conference and pioneering the use of models and analytics in basketball. In this report, we will look at many facets of the NBA, including how offense and defense are correlated, how the game has sped up from 2000 to 2018, how the conferences are different from each other, and how we can try to predict the number of wins an NBA team will have based on advanced statistics about the team.
Let’s first set the years we want data from.
BEGIN = 2000
END = 2019
We will be getting most of our data from Basketball-Reference, which is already formatted well. All of the data will be downloaded as .csv, so the dataframe will simply be read in as .csv files. There won’t be any holes in the data, as all of these “advanced statistics” are tracked back to 1984. Only data from the 1999-2000 season and onwards will be used, as the pace of the game and use of the 3 point line has changed significantly since the 80s and 90s.
# Get basic team data from each season, loop through every of saved data from Basketball-Reference and get the most useful columns, clean the names portion of the data
results <- data.frame()
for (i in BEGIN:END) {
index = i - (BEGIN - 1)
temp <- read.csv(sprintf("%d_data.csv", i))
# Remove the last row, which is league average
temp <- head(temp, -1)
# Remove the asterisk at the end of playoff team names
temp$Team <- gsub("\\*$", "", temp$Team)
temp$Year <- as.numeric(rep(i, nrow(temp)))
# Merge dataframe
results <- rbind(results, temp)
}
dim(results)
## [1] 595 29
names(results)
## [1] "Rk" "Team" "Age" "W" "L"
## [6] "PW" "PL" "MOV" "SOS" "SRS"
## [11] "ORtg" "DRtg" "NRtg" "Pace" "FTr"
## [16] "X3PAr" "TS." "eFG." "TOV." "ORB."
## [21] "FT.FGA" "eFG..1" "TOV..1" "DRB." "FT.FGA.1"
## [26] "Arena" "Attend." "Attend..G" "Year"
attach(results)
We end up with 595 rows of data, each representing a season of a team, which should be more than enough to run some interesting analysis.
This year, many teams that have strong offenses also have strong defenses. To see how this holds up, we can do a correlation between offensive rating (based on points per possession) and defensive rating (based on points given up per possession). A higher offensive rating is better, and a lower defensive rating is better. We will do a correlation between the two and a bootstrapped correlation as well. First, we make a scatterplot.
# Scatterplot of the two
plot(ORtg, DRtg,
main="Defensive Rating vs. Offensive Rating",
xlab="Offensive Rating",
ylab="Defensive Rating",
pch=19,
col="blue")
From the scatterplot, it seems like there is minimal correlation between offensive and defensive rating, although there might be some sort of a positive slope between the two. We can do the correlation test now.
# Correlation test
cor.test(ORtg, DRtg)
##
## Pearson's product-moment correlation
##
## data: ORtg and DRtg
## t = 3.8655, df = 593, p-value = 0.0001231
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07736796 0.23420220
## sample estimates:
## cor
## 0.1567733
The p-value of 0.0001 is significant for an alpha of .05. We get a correlation of approximatley 0.156, which seems to show that as offense gets better (and vice versa), defense gets slightly worse. This could be explained by the fact that teams that exert a lot of engery have offense have less left in the tank for defense. A prime example of this phenomenon is the 1983-84 Denver Nuggets. They scored a league-leading 123.7 points per game (which would still lead the league by 5 ppg last season), but gave up a league worst 124.8 points per game.
Adding the line yields the following scatterplot:
lm1 <- lm(ORtg ~ DRtg)
plot(ORtg, DRtg,
main="DRtg vs. ORtg with Regression Line",
xlab="Offensive Rating",
ylab="Defensive Rating",
pch = 19,
col="blue")
#By default, abline() assumes we provide slope and intercept, which is what is contained in lm1$coef
abline(lm1$coef, col="red", lwd=3)
A bootstrapped correlation should give a similar result:
# Get number of rows
N <- length(DRtg)
n_samp <- 10000
corResults <- rep(NA, n_samp)
for (i in 1:n_samp) {
s <- sample(1:N, N, replace=T)
sVals <- as.numeric(names(table(s)))
sCounts <- as.vector(table(s))
fakeD <- rep(DRtg[sVals], sCounts)
fakeO <- rep(ORtg[sVals], sCounts)
cor1 <- cor(fakeD, fakeO)
corResults[i] <- cor1
}
# Graph the correlations
ci_r <- quantile(corResults, c(.025, .975))
hist(corResults, col = "blue", main = "Bootstrapped Correlations", xlab = "Sample Correlation", breaks = 50)
abline(v = ci_r, lwd = 3, col = "red")
abline(v = cor.test(DRtg, ORtg)$conf.int,lwd = 3, col = "green", lty = 2)
The red lines show the 95% confidence interval for the bootstrap, while the green dotted lines show the 95% confidence interval for the correlation test. In this case, the bootstrapped confidence interval also doesn’t include 0, so it also found a statistically significant correlation between offense and defensive rating. Generally, the bootstrap is more flexible, and seeing as computers are getting more and more powerful, these tests are becoming more and more common.
Another feature to investigate is the pace of the game, which is roughly how fast possessions change. People have long argued that the game has sped up since 2000. It’s easy to check with a t-test on pace, which is an estimate of the number of possessions per 48 minutes.
pace_2000 <- results[results$Year == 2000, "Pace"]
pace_2019 <- results[results$Year == 2019, "Pace"]
(mean(pace_2019) - mean(pace_2000))
## [1] 6.947011
A difference of almost 7 possessions per game is absolutely significant since each possession can take up to 25 seconds and the entire game is only 48 minutes. We can do a t-test to find out. First, we can take a look at the distribution of the pace in both years.
hist(pace_2000,
main="Histogram of Team Paces in 2000",
xlab="Pace for Teams",
breaks=10,
col="red")
hist(pace_2019,
main="Histogram of Team Paces in 2019",
xlab="Pace for Teams",
breaks=10,
col="red")
With a sizeable sample of 30 teams, it seems like the assumptions for the t-test can be generally met (the Central Limit Theorem may work its magic a little). The distributions are generally symmetric and seem somehwat more clustered at center. We then run the t-test.
t.test(pace_2019, pace_2000)
##
## Welch Two Sample t-test
##
## data: pace_2019 and pace_2000
## t = 11.69, df = 55.569, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 5.756352 8.137671
## sample estimates:
## mean of x mean of y
## 100.03667 93.08966
The t-test result has a p-value of almost 0, which definitively shows that the pace of the game has sped up from 2000 to 2019. In general, we could also run a non-parametric Kruskal-Wallis tests that requires no assumptions on the underlying distribution of the data.
kruskal.test(list(pace_2000, pace_2019))
##
## Kruskal-Wallis rank sum test
##
## data: list(pace_2000, pace_2019)
## Kruskal-Wallis chi-squared = 40.567, df = 1, p-value = 1.9e-10
In this case as well, we can safely conclude that there is strong evidence that the pace has increased from 2000 to 2019.
We can start off with a simple stepwise backwards regression. Beginning with all of the factors, we remove the least significant one until there are only significant factors left. The process is done manually, and we end up with a far more simplified model than the 19 original factors.
# Create temporary dataframe for multiple regression that removes useless variables and brings wins to the frong
results1 <- results[, c(-1,-2,-4, -5, -6, -7, -13, -26, -28, -29)]
results1 <- results1[, c(2:19, 1)]
# Original model with all predictors
lm1 <- lm(W ~ MOV + SOS + ORtg + DRtg + Pace + FTr + X3PAr + TS. + eFG. + TOV. + ORB. + FT.FGA + eFG..1 + TOV..1 + DRB. + FT.FGA.1 + Attend. + Age, data=results1)
summary(lm1)
##
## Call:
## lm(formula = W ~ MOV + SOS + ORtg + DRtg + Pace + FTr + X3PAr +
## TS. + eFG. + TOV. + ORB. + FT.FGA + eFG..1 + TOV..1 + DRB. +
## FT.FGA.1 + Attend. + Age, data = results1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.8814 -2.0088 0.1278 2.2419 7.1765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.171e+02 4.956e+01 -2.362 0.01849 *
## MOV 1.539e+00 8.045e-01 1.913 0.05623 .
## SOS -1.313e-02 3.805e-01 -0.035 0.97248
## ORtg 9.985e-01 1.032e+00 0.967 0.33387
## DRtg 5.542e-01 9.252e-01 0.599 0.54941
## Pace 4.224e-02 5.714e-02 0.739 0.46010
## FTr 1.144e+02 5.890e+01 1.941 0.05270 .
## X3PAr -4.793e+00 5.023e+00 -0.954 0.34033
## TS. 4.494e+02 2.863e+02 1.570 0.11701
## eFG. -4.123e+02 2.399e+02 -1.719 0.08622 .
## TOV. 1.548e-02 1.045e+00 0.015 0.98819
## ORB. -8.725e-02 3.991e-01 -0.219 0.82701
## FT.FGA -2.210e+02 1.216e+02 -1.817 0.06968 .
## eFG..1 -2.179e+02 7.590e+01 -2.871 0.00425 **
## TOV..1 1.752e+00 6.785e-01 2.582 0.01006 *
## DRB. 6.001e-01 2.473e-01 2.426 0.01557 *
## FT.FGA.1 -3.948e+01 1.578e+01 -2.502 0.01262 *
## Attend. 1.128e-05 1.668e-06 6.760 3.40e-11 ***
## Age 4.129e-01 9.714e-02 4.250 2.49e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.226 on 576 degrees of freedom
## Multiple R-squared: 0.9347, Adjusted R-squared: 0.9327
## F-statistic: 458.3 on 18 and 576 DF, p-value: < 2.2e-16
# After backwards stepwise regression
backwards <- lm(W ~ MOV + ORtg + eFG..1 + TOV..1 + DRB. + FT.FGA.1 + Attend. + Age, data=results1)
summary(backwards)
##
## Call:
## lm(formula = W ~ MOV + ORtg + eFG..1 + TOV..1 + DRB. + FT.FGA.1 +
## Attend. + Age, data = results1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.3756 -2.0226 0.1157 2.2785 7.3638
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.922e+01 2.670e+01 -2.593 0.009759 **
## MOV 1.163e+00 3.803e-01 3.059 0.002322 **
## ORtg 1.265e+00 3.493e-01 3.622 0.000317 ***
## eFG..1 -1.866e+02 5.571e+01 -3.349 0.000863 ***
## TOV..1 1.492e+00 4.997e-01 2.986 0.002943 **
## DRB. 4.773e-01 1.725e-01 2.767 0.005831 **
## FT.FGA.1 -3.121e+01 1.158e+01 -2.695 0.007235 **
## Attend. 1.135e-05 1.656e-06 6.855 1.81e-11 ***
## Age 4.187e-01 9.059e-02 4.622 4.69e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.22 on 586 degrees of freedom
## Multiple R-squared: 0.9339, Adjusted R-squared: 0.933
## F-statistic: 1034 on 8 and 586 DF, p-value: < 2.2e-16
The backwards stepwise regression gives a model with 8 predictors iwth an R^2 of about 0.93, which is quite a good amount. We can try to use best subsets regression to find a better model. For this, we go up to a maximum model of 12 predictors.
library(leaps)
# Create temporary dataframe for best subsets regression that removes useless variables and brings wins to the frong
results1 <- results[, c(-1,-2,-4, -5, -6, -7, -13, -26, -28, -29)]
results1 <- results1[, c(2:19, 1)]
# Get best subsets results - max 12 variables
mod1 <- regsubsets(W ~ ., data=results1, nvmax=12)
(mod1sum <- summary(mod1))
## Subset selection object
## Call: (function(input_file, encoding) {
## out_dir <- "docs"
## rmarkdown::render(input_file, encoding = encoding, output_file = file.path(dirname(input_file),
## out_dir, "index.html"))
## })("/Users/18sunbri/Documents/Yale/Freshman/Second Semester/STAT 230/nba-wins/NBA Data Analysis.Rmd",
## encoding = "UTF-8")
## 19 Variables (and intercept)
## Forced in Forced out
## MOV FALSE FALSE
## SOS FALSE FALSE
## SRS FALSE FALSE
## ORtg FALSE FALSE
## DRtg FALSE FALSE
## Pace FALSE FALSE
## FTr FALSE FALSE
## X3PAr FALSE FALSE
## TS. FALSE FALSE
## eFG. FALSE FALSE
## TOV. FALSE FALSE
## ORB. FALSE FALSE
## FT.FGA FALSE FALSE
## eFG..1 FALSE FALSE
## TOV..1 FALSE FALSE
## DRB. FALSE FALSE
## FT.FGA.1 FALSE FALSE
## Attend. FALSE FALSE
## Age FALSE FALSE
## 1 subsets of each size up to 12
## Selection Algorithm: exhaustive
## MOV SOS SRS ORtg DRtg Pace FTr X3PAr TS. eFG. TOV. ORB. FT.FGA
## 1 ( 1 ) "*" " " " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " " " " " " " " " " " " " " " " " " "
## 4 ( 1 ) "*" " " " " "*" " " " " " " " " " " " " " " " " " "
## 5 ( 1 ) "*" " " " " "*" " " " " " " " " " " " " " " " " " "
## 6 ( 1 ) "*" " " " " "*" "*" " " " " " " " " " " " " " " " "
## 7 ( 1 ) "*" " " " " " " "*" " " " " " " "*" " " "*" "*" " "
## 8 ( 1 ) "*" " " " " "*" " " " " " " " " " " " " " " " " " "
## 9 ( 1 ) "*" " " " " "*" " " " " "*" " " " " " " " " " " " "
## 10 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " " " " " " " " "
## 11 ( 1 ) "*" " " " " "*" " " "*" "*" "*" " " " " " " " " " "
## 12 ( 1 ) "*" " " " " "*" " " " " "*" " " "*" "*" " " " " "*"
## eFG..1 TOV..1 DRB. FT.FGA.1 Attend. Age
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " "*" "*"
## 4 ( 1 ) " " " " " " " " "*" "*"
## 5 ( 1 ) "*" " " " " " " "*" "*"
## 6 ( 1 ) "*" " " " " " " "*" "*"
## 7 ( 1 ) " " " " " " " " "*" "*"
## 8 ( 1 ) "*" "*" "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*"
Now we can determine which one of these is the best model with respect to measures including the Adjusted R-Squared and Bayesian information criterion (BIC).
First, with the Adjusted R-Squared.
#Fit this model and show results
AR <- results1[,mod1sum$which[which.max(mod1sum$adjr2),][-1]]
summary(lm(W ~ .,data=AR))
##
## Call:
## lm(formula = W ~ ., data = AR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.4455 -1.9822 0.0797 2.2874 7.0068
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.048e+02 3.164e+01 -3.312 0.000984 ***
## MOV 1.105e+00 3.823e-01 2.891 0.003987 **
## ORtg 1.325e+00 3.608e-01 3.672 0.000262 ***
## FTr 1.140e+02 5.746e+01 1.984 0.047757 *
## TS. 4.898e+02 2.668e+02 1.836 0.066909 .
## eFG. -4.370e+02 2.367e+02 -1.846 0.065376 .
## FT.FGA -2.256e+02 1.193e+02 -1.892 0.059040 .
## eFG..1 -1.936e+02 5.589e+01 -3.463 0.000572 ***
## TOV..1 1.551e+00 5.017e-01 3.091 0.002091 **
## DRB. 5.185e-01 1.759e-01 2.947 0.003332 **
## FT.FGA.1 -3.514e+01 1.189e+01 -2.955 0.003255 **
## Attend. 1.125e-05 1.656e-06 6.794 2.69e-11 ***
## Age 4.164e-01 9.159e-02 4.546 6.65e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.216 on 582 degrees of freedom
## Multiple R-squared: 0.9344, Adjusted R-squared: 0.9331
## F-statistic: 691.4 on 12 and 582 DF, p-value: < 2.2e-16
Even though the adjusted R-squared result is adjusted for the number of predictors in the model, it seems like it still chose the subset of regressors that was the largest, which in this case was 12. We can try the BIC instead, and see if it selects a model that is smaller and easier to interpret.
#Fit the model with the best bic
AR <- results1[, mod1sum$which[which.min(mod1sum$bic),][-1]]
mod2 <- lm(W ~ ., data=AR)
summary(mod2)
##
## Call:
## lm(formula = W ~ ., data = AR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.8061 -2.0611 0.0566 2.3441 7.2249
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.166e+01 2.487e+00 8.709 < 2e-16 ***
## MOV 2.473e+00 3.571e-02 69.252 < 2e-16 ***
## Attend. 1.183e-05 1.603e-06 7.379 5.43e-13 ***
## Age 3.946e-01 8.891e-02 4.438 1.08e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.255 on 591 degrees of freedom
## Multiple R-squared: 0.9318, Adjusted R-squared: 0.9315
## F-statistic: 2692 on 3 and 591 DF, p-value: < 2.2e-16
The three variables that remain have p values of almost 0, and the model accounts for 93 of the variance of wins around the mean. The distinct three variables are interesting. * Attendance can be seen as a proxy variable for the size of the market that the team is in. It’s always been a trend that top free agents often head towards the largest markets, as they increase their marketability there. This regression seems to confirm such a bias. * Age being positive is also unsurprising. It is rare for young teams to perform well as most of the players are inexperienced (save for a few exceptions like OKC 7-8 years ago). Teams usually need some sort of veteran presence (33+ years old) to perform well, and most championship teams had their fair share of older people. * MOV stands for margin of victory, which is the average amount a team beat its opponents by in a season. This is a good measure of how strong a team is in general; it probably takes into account other variables like offensive rating, defensive rating, true shooting percentage, turnover percent, etc. It’s not surprising at all that this statistic is in there.
If we didn’t care as much about inference, we could try to use ridge regression and the LASSO to regularize the coefficients so that they could be used on new datasets. We could see how our model works with the 2019-2020 season. I will only choose some of the possible factors for ridge regression - the code is more or less proof of concept.
# Used for ridge regression
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-18
# Generate a list of hyperparameters for L2 regularization
lambdas <- 10^seq(3, -2, by = -.1)
# cv.glmnet() finds the best lambda through checking all of the values we provided via cross-validation
fit <- cv.glmnet(as.matrix(results1[, c(1, 4, 10, 15, 16, 17, 18, 19)]), W, alpha=0, lambda=lambdas)
# This gives a collection of all of the fit models
summary(fit$glmnet.fit)
## Length Class Mode
## a0 51 -none- numeric
## beta 408 dgCMatrix S4
## df 51 -none- numeric
## dim 2 -none- numeric
## lambda 51 -none- numeric
## dev.ratio 51 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
# Get R^2
(optimal <- fit$lambda.min)
## [1] 0.01
y <- predict(fit, s=optimal, newx=as.matrix(results1[, c(1, 4, 10, 15, 16, 17, 18, 19)]))
sst <- sum((y - mean(y))^2)
sse <- sum((W - y)^2)
(rsq <- 1 - sse / sst)
## [1] 0.9276295
# Get the coefficients
(coeffs <- coef(fit, s=optimal))
## 9 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.336907e+01
## MOV 2.416514e+00
## ORtg 1.362855e-01
## eFG. -4.375503e+00
## TOV..1 -8.462039e-02
## DRB. -5.611834e-02
## FT.FGA.1 2.464881e+00
## Attend. 1.137122e-05
## Age 4.331551e-01
In this case, it seems that the R^2 value on the training data has not really changed, mostly because the optimal regularization lambda was so close to 0. In other contexts with a larger lambda, more shrinkage of the parameters would probably decrease R^2 somehwat. Looking at the coefficients, since the optimal lambda is so low, they aren’t that much lower than what they were for the backwards stepwise regression (both of these models used the same factors).
Now, we will try run the lasso, which uses L1 regularization. With the lasso, it’s possible to conclude that certain betas should be equal to zero, essentially taking out factors form the model. Most of the code will be extremely similar to that of before, except that we can set alpha=1 as that’s the coefficient of the L1 term in the glmnet().
# Generate a list of hyperparameters for L2 regularization
lambdas <- 10^seq(3, -2, by = -.1)
# cv.glmnet() finds the best lambda through checking all of the values we provided via cross-validation
fit <- cv.glmnet(as.matrix(results1), W, alpha=1, lambda=lambdas)
# This gives a collection of all of the fit models
summary(fit$glmnet.fit)
## Length Class Mode
## a0 51 -none- numeric
## beta 969 dgCMatrix S4
## df 51 -none- numeric
## dim 2 -none- numeric
## lambda 51 -none- numeric
## dev.ratio 51 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
# Get R^2
(optimal <- fit$lambda.min)
## [1] 0.1
y <- predict(fit, s=optimal, newx=as.matrix(results1))
sst <- sum((y - mean(y))^2)
sse <- sum((W - y)^2)
(rsq <- 1 - sse / sst)
## [1] 0.9264147
# Get the coefficients
(coeffs <- coef(fit, s=optimal))
## 20 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.135528e+01
## MOV 2.377203e+00
## SOS .
## SRS .
## ORtg 1.232150e-01
## DRtg .
## Pace .
## FTr 2.249367e+00
## X3PAr .
## TS. .
## eFG. .
## TOV. .
## ORB. .
## FT.FGA 3.955437e-01
## eFG..1 -5.281736e+00
## TOV..1 .
## DRB. .
## FT.FGA.1 .
## Attend. 1.063569e-05
## Age 3.917975e-01
We can see that in this case, the lasso selected the following factors: MOV, ORtg, FTr, FT.FGA, Attend., and Age. Recall that the best model from best subsets regression contained MOV, Attend., and Age; all three were selected by the lasso. For good predictive power, this model would probably used for test data.
Another regression method extremely similar to the lasso is the least-angle regression. The algorithm begins with all betas set to 0, and in general, the betas that are most correlated with the residuals are slowly increased, until all predictors are included in the model.
library(lars)
## Loaded lars 1.2
lars_res <- lars(as.matrix(results1), W, type="lar")
summary(lars_res)
## LARS/LAR
## Call: lars(x = as.matrix(results1), y = W, type = "lar")
## Df Rss Cp
## 0 1 91845 8241.038
## 1 2 8839 259.142
## 2 3 7674 149.089
## 3 4 6715 58.840
## 4 5 6658 55.356
## 5 6 6271 20.201
## 6 7 6227 17.968
## 7 8 6216 18.914
## 8 9 6182 17.572
## 9 10 6181 19.465
## 10 11 6163 19.795
## 11 12 6161 21.584
## 12 13 6134 20.983
## 13 14 6133 22.901
## 14 15 6126 24.182
## 15 16 6032 17.198
## 16 17 6031 19.106
## 17 18 6014 19.437
## 18 19 6007 20.764
## 19 20 5978 20.000
(lars_res$R2)
## 0 1 2 3 4 5 6
## 0.0000000 0.9037651 0.9164494 0.9268919 0.9275127 0.9317186 0.9321977
## 7 8 9 10 11 12 13
## 0.9323171 0.9326954 0.9327075 0.9328965 0.9329203 0.9332147 0.9332241
## 14 15 16 17 18 19
## 0.9333055 0.9343225 0.9343329 0.9345218 0.9345980 0.9349109
(lars_res$beta)
## MOV SOS SRS ORtg DRtg Pace
## 0 0.000000 0.00000 0.000000000 0.00000000 0.0000000 0.00000000
## 1 2.275677 0.00000 0.000000000 0.00000000 0.0000000 0.00000000
## 2 2.351397 0.00000 0.000000000 0.00000000 0.0000000 0.00000000
## 3 2.404304 0.00000 0.000000000 0.00000000 0.0000000 0.00000000
## 4 2.399787 0.00000 0.009247839 0.00000000 0.0000000 0.00000000
## 5 2.413614 0.00000 -0.004190222 0.06839784 0.0000000 0.00000000
## 6 2.414713 0.00000 -0.007938758 0.08146705 0.0000000 0.00000000
## 7 2.415251 0.00000 -0.009762722 0.08662582 0.0000000 0.00000000
## 8 2.395381 0.00000 -0.037981331 0.14540661 0.0000000 0.00000000
## 9 2.395481 0.00000 -0.039484628 0.14749358 0.0000000 0.00000000
## 10 2.389438 0.00000 -0.056245003 0.17629131 0.0000000 0.00000000
## 11 2.387099 0.00000 -0.056394315 0.17471652 0.0000000 0.00000000
## 12 2.376129 0.00000 -0.077669007 0.16589039 0.0000000 0.02616903
## 13 2.375870 0.00000 -0.079062086 0.16583187 0.0000000 0.02718247
## 14 2.353046 0.00000 -0.086048143 0.18159998 0.0000000 0.03180036
## 15 1.383168 0.00000 -0.059110004 1.04366158 0.0000000 0.04357883
## 16 1.351380 0.00000 -0.057675533 1.04834657 0.0000000 0.04392255
## 17 1.291465 0.00000 -0.048594252 1.11351596 0.0000000 0.04389211
## 18 1.343398 0.00000 -0.043144554 1.09138788 0.1074586 0.04359463
## 19 31.313163 29.78207 -29.827505144 1.06715022 0.5252641 0.04122264
## FTr X3PAr TS. eFG. TOV. ORB.
## 0 0.0000000 0.00000000 0.0000000 0.0000 0.00000000 0.000000000
## 1 0.0000000 0.00000000 0.0000000 0.0000 0.00000000 0.000000000
## 2 0.0000000 0.00000000 0.0000000 0.0000 0.00000000 0.000000000
## 3 0.0000000 0.00000000 0.0000000 0.0000 0.00000000 0.000000000
## 4 0.0000000 0.00000000 0.0000000 0.0000 0.00000000 0.000000000
## 5 0.0000000 0.00000000 0.0000000 0.0000 0.00000000 0.000000000
## 6 0.0000000 0.00000000 0.0000000 0.0000 0.00000000 0.000000000
## 7 0.9515168 0.00000000 0.0000000 0.0000 0.00000000 0.000000000
## 8 3.2692198 0.00000000 0.0000000 0.0000 0.00000000 0.000000000
## 9 3.3600133 -0.03623664 0.0000000 0.0000 0.00000000 0.000000000
## 10 5.1203236 -0.54800829 0.0000000 0.0000 -0.03279755 0.000000000
## 11 5.4262191 -0.65263144 0.8974498 0.0000 -0.04210829 0.000000000
## 12 8.8150794 -2.28664468 9.3195131 0.0000 -0.14430750 0.000000000
## 13 8.9570875 -2.35789597 9.6516710 0.0000 -0.14811193 0.000000000
## 14 9.7588550 -2.57992011 12.0266465 0.0000 -0.17451506 0.000000000
## 15 13.9756141 -4.00446530 13.3126754 0.0000 -0.20285408 0.000000000
## 16 13.9669954 -3.94222052 17.2159712 0.0000 -0.23587049 0.012472295
## 17 44.5076530 -4.00703369 158.0899318 -127.8779 -0.21043451 0.003418569
## 18 58.0584613 -4.16131821 214.5824400 -183.0619 -0.16637821 -0.014209118
## 19 116.4416227 -5.10042342 455.4635569 -420.1211 0.03803125 -0.096340893
## FT.FGA eFG..1 TOV..1 DRB. FT.FGA.1 Attend.
## 0 0.000000 0.000000 0.00000000 0.0000000 0.00000000 0.000000e+00
## 1 0.000000 0.000000 0.00000000 0.0000000 0.00000000 0.000000e+00
## 2 0.000000 0.000000 0.00000000 0.0000000 0.00000000 3.695926e-06
## 3 0.000000 0.000000 0.00000000 0.0000000 0.00000000 7.223517e-06
## 4 0.000000 0.000000 0.00000000 0.0000000 0.00000000 7.523463e-06
## 5 0.000000 0.000000 0.00000000 0.0000000 0.00000000 9.859434e-06
## 6 2.214310 0.000000 0.00000000 0.0000000 0.00000000 1.029147e-05
## 7 1.650658 0.000000 0.00000000 0.0000000 0.00000000 1.041917e-05
## 8 -1.031976 -8.244214 0.00000000 0.0000000 0.00000000 1.068495e-05
## 9 -1.167341 -8.456929 0.00000000 0.0000000 0.00000000 1.069502e-05
## 10 -3.234818 -12.152363 0.00000000 0.0000000 0.00000000 1.084348e-05
## 11 -3.599360 -12.601009 0.00000000 0.0000000 0.00000000 1.085321e-05
## 12 -7.414763 -19.145940 0.00000000 0.0000000 0.00000000 1.099257e-05
## 13 -7.554053 -19.450482 0.00000000 0.0000000 -0.08021262 1.100023e-05
## 14 -8.393962 -24.019157 0.04027553 0.0000000 -1.33386034 1.110116e-05
## 15 -12.585509 -161.700501 1.26792055 0.4235655 -28.41060547 1.124560e-05
## 16 -12.523441 -166.061112 1.30704400 0.4365487 -29.26208083 1.125285e-05
## 17 -76.529161 -173.807009 1.37052499 0.4603023 -30.78545398 1.124041e-05
## 18 -104.551798 -182.352345 1.44444345 0.4873861 -32.47276771 1.124753e-05
## 19 -225.510445 -221.184212 1.80217773 0.6106848 -40.79306874 1.131474e-05
## Age
## 0 0.0000000
## 1 0.0000000
## 2 0.0000000
## 3 0.1712203
## 4 0.1857816
## 5 0.3428698
## 6 0.3704174
## 7 0.3793758
## 8 0.3965817
## 9 0.3971469
## 10 0.4048872
## 11 0.4046308
## 12 0.4115422
## 13 0.4117050
## 14 0.4103095
## 15 0.4169288
## 16 0.4171108
## 17 0.4151222
## 18 0.4147042
## 19 0.4146079
## attr(,"scaled:scale")
## [1] 1.099136e+02 9.552171e+00 1.070545e+02 9.235252e+01 8.814477e+01
## [6] 8.361948e+01 9.358254e-01 1.711328e+00 5.164501e-01 5.718057e-01
## [11] 2.565571e+01 7.581284e+01 6.922957e-01 5.316770e-01 2.678100e+01
## [16] 6.509472e+01 7.011162e-01 2.251848e+06 4.245210e+01
From the R2 and betas, we see that MOV is a very strong predictor of wins (which should be the case). For the lasso and best subsets, we could only tell that the best results came from a combination of MOV, Age, and Attendance. In this case, we can see that MOV immediately can explain up to 90% of the variation in wins.
For another way to visualize the data, we can first do a Principal Components Analysis (PCA). In essence, PCA applies a linear transformation to the data such that an orthogonal basis is formed such that each principal component captures as much variance in the data as possible. For physics people, think of the principal axis theorem. To do so, we drop the non-numerical data and other redundant columns.
pca_df <- results[, c(-1, -2, -5, -6, -7, -8, -26, -28, -29)]
pca <- prcomp(pca_df, center=TRUE, scale=TRUE)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.538 2.227 1.4244 1.06346 0.97783 0.94579 0.89123
## Proportion of Variance 0.322 0.248 0.1014 0.05655 0.04781 0.04473 0.03971
## Cumulative Proportion 0.322 0.570 0.6714 0.72798 0.77579 0.82051 0.86023
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.84012 0.71768 0.69726 0.62684 0.57550 0.50243
## Proportion of Variance 0.03529 0.02575 0.02431 0.01965 0.01656 0.01262
## Cumulative Proportion 0.89552 0.92127 0.94558 0.96522 0.98178 0.99441
## PC14 PC15 PC16 PC17 PC18 PC19
## Standard deviation 0.23334 0.22229 0.07668 0.03244 0.02931 0.01469
## Proportion of Variance 0.00272 0.00247 0.00029 0.00005 0.00004 0.00001
## Cumulative Proportion 0.99713 0.99960 0.99989 0.99995 0.99999 1.00000
## PC20
## Standard deviation 1.696e-15
## Proportion of Variance 0.000e+00
## Cumulative Proportion 1.000e+00
We can plot the two largest principal components and see where the points land.
library(devtools)
## Loading required package: usethis
install_github("vqv/ggbiplot")
## Skipping install of 'ggbiplot' from a github remote, the SHA1 (7325e880) has not changed since last install.
## Use `force = TRUE` to force installation
library(ggbiplot)
## Loading required package: ggplot2
## Loading required package: plyr
## Loading required package: scales
## Loading required package: grid
ggbiplot(pca, ellipse=TRUE, group=results[, "Team"])
It’s difficult to see much on the graph, but it’s important as we’ve seen earlier that approximately 6-7 principal components can cover over 80% of variance.
Using the principal components, we can do principal components regression, where we choose some of the principal components as regressors. In this way, we can achieve dimensionality reduction , avoid multicollinearity, and cause reduce overfitting (theoretically).
require(pls)
## Loading required package: pls
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
pcr_model <- pcr(W ~ ., data=results1, scale=TRUE, validation="CV")
summary(pcr_model)
## Data: X dimension: 595 19
## Y dimension: 595 1
## Fit method: svdpc
## Number of components considered: 19
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 12.45 11.13 3.737 3.721 3.683 3.464 3.452
## adjCV 12.45 11.12 3.734 3.719 3.680 3.451 3.449
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 3.412 3.261 3.258 3.263 3.272 3.272 3.276
## adjCV 3.409 3.257 3.253 3.259 3.268 3.268 3.272
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 3.280 3.281 3.277 3.282 3.283 3.281
## adjCV 3.276 3.276 3.272 3.277 3.278 3.275
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.22 55.12 65.79 71.73 76.72 81.43 85.60 89.29
## W 20.63 91.05 91.20 91.48 92.56 92.57 92.77 93.33
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 92.00 94.56 96.62 98.37 99.69 99.96 99.99
## W 93.35 93.35 93.36 93.38 93.39 93.39 93.41
## 16 comps 17 comps 18 comps 19 comps
## X 100.00 100.00 100.00 100.00
## W 93.44 93.44 93.47 93.49
# Plot cross validation MSE and R^2
validationplot(pcr_model, val.type="MSEP")
validationplot(pcr_model, val.type="R2")
In this case we can see that we really only need 3 principal components to capture a majority of the variance of the data, and that adding more principal components does little at the expense of a far more complex model.
We could also try to run some non-parametric regressions to maximize our predictive power. In this case, we will take a look at many methods with splines. We will just use one parameter, Age, which was one of the most significant predictors when used for best subsets regression. We use first do a cubic spline the smoothing splines function, which means that we won’t need to select the locations of the knots if we used a function like cubic splines.
library(splines)
spline_res <- smooth.spline(Age, W, cv=TRUE)
## Warning in smooth.spline(Age, W, cv = TRUE): cross-validation with non-
## unique 'x' values seems doubtful
spline_res
## Call:
## smooth.spline(x = Age, y = W, cv = TRUE)
##
## Smoothing Parameter spar= 0.8294247 lambda= 0.02881927 (12 iterations)
## Equivalent Degrees of Freedom (Df): 4.732098
## Penalized Criterion (RSS): 8141.522
## PRESS(l.o.o. CV): 111.8497
# It selected lambda = 0.01, df = 5.9, plot it
plot(Age, W, col="grey")
lines(spline_res, lwd=2, col="purple")
legend("topright", ("Smoothing Splines with 6.78 df selected by CV"), col="purple", lwd=2)
We see that when regressing Age on wins, the spline fines a non-lienar fit that tapers off at the end. This could be a little better than a simple linear regression, but the improvement in this case sholdn’t be much.