Chapter 7 Model Selection

Model selection (subset selection, feature seletion, variable selection) in the linear regression

\[ Y=\beta_0+\beta_1X_1 +\cdots +\beta_pX_p +\epsilon \]

refers typically to the selection of the most appropriate subset from the \(p\) explanatory variables that best predict and capture variability in \(Y\).

Despite its simplicity, the linear model has distinct advantages in terms of its interpretability and often shows good predictive performance.
Hence we discuss some ways in which the simple linear model can be improved, by replacing ordinary least squares fitting with some alternative fitting procedures.

Why consider alternatives to least squares?

Prediction accuracy: especially when \(p>n\), to control the variance.
Model interpretability: By removing irrelevant features - that is, by setting the corresponding coefficient estimates to zero - we can obtain a model that is more easily interpreted.
- We will present some approaches for automatically performing feature selection.

Three classes of methods

Subset selection
- We identify a subset of the \(p\) predictors that we believe to be related to the response.
- We then fit a model using least squares on the reduced set of variables.
Shrinkage
- We fit a model involving all \(p\) predictors, but the estimated coefficients are shrunken towards zero relative to the least squares estimates.
- This shrinkage (also known as regularization or penalization) has the effect of reducing variance and can also perform variable selection.
Dimension reduction
- We project the \(p\) predictors into a \(M\)-dimensional subspace, where \(M<p\).
- This is achieved by computing \(M\) different linear combinations, or projections, of the variables.
- Then these \(M\) projections are used as predictors to fit a linear regression model by least squares.

7.1 Variable Selection

Best subset and stepwise model selection procedures.

7.1.1 Best Subset Selection

Let \(M_0\) denote the null model, which contains no predictors.

- This model simply predicts the sample mean for each observation.

For \(k=1,2,\ldots,p\):

(a) Fit all \(\binom{p}{k}\) models that contains exactly \(k\) predictors.

(b) Pick the best among these \(\binom{p}{k}\) models, and call it \(M_k\).

- Here best is defined as having the smallest RSS, or equivalently largest \(R^2\).

Select a single best model from among \(M_0,\ldots, M_p\) using cross-validated prediction error, \(C_p\) (AIC), BIC, or adjusted \(R^2\).

For each possible model containing a subset of the ten predictors in the Credit data set, the RSS and \(R^2\) are displayed.
- The red frontier tracks the best model for a given number of predictors, according to RSS and \(R^2\).
- Though the data set contaings only ten predictors, the \(x\)-axis ranges from 1 to 11, since one of the variables is categorical and takes on three values, leading th the creation of two dummy variables.
Although we have presented best subset selection here for least squares regression, the same ideas apply to other types of models, such as logistic regression.
- The deviance - negative two times the maximized log-likelihood - plays the role of RSS for a broader class of models.

Stepwise selection

For computational reasons, best subset selection cannot be applied with very large \(p\). why not?
Best subset selection may also suffer from statistical problems when \(p\) is large: larger the search space, the higher the chance of finding models that look good on the training data, even though they might not have any predictive poser on future data.
Thus an enormous search space can lead to overfitting and high variance of the coefficient estimates.
For both of these reasons, stepwise methods, which explore a far more restricted set of models, are attractive alternatives to best subset selection

7.1.2 Forward Step Wise Selection

Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model.
In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model.

Let \(M_0\) denote the null model, which contains no predictors.
For \(k=0,\ldots,p-1\):

(a) Consider all \(p-k\) models that augment the predictors in \(M_k\) with one additional predictor.

(b) Choose the best among these \(p-k\) models, and call it \(M_{k+1}\).

- Here best is defined as having smallest RSS or highest \(R^2\).

Select a single best model from among \(M_0, \ldots, M_p\) using cross-validated prediction error, \(C_p\) (AIC), BIC, or adjusted \(R^2\).

Computational advantage over best subset selection is clear.
It is not guaranteed to find the best possible model out of all \(2^p\) models containing subsets of the \(p\) predictors.

The first four selected models for best subset selection and forward stepwise selection on the Credit data set.
- The first three models are identical but the fourth models differ.

7.1.3 Backward Step Wise Selection

Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection.
However, unlike forward stepwise selection, it begins with the full least squares model containing all \(p\) predictors, and then iteratively removes the least useful predictor, one-at-a-time.

Let \(M_p\) denote the full model, which contains all \(p\) predictors.
For \(k=p, p-1, \ldots, 1\):

(a) Consider all \(k\) models that contain all but one of the predictors in \(M_k\), for a total of \(k-1\) predictors.

(b) Choose the best among these \(k\) models, and call it \(M_k-1\).

- Here best is defined as having smallest RSS or highest \(R^2\).

Select a single best model from among \(M_0,\ldots,M_p\) using cross-validated prediction error, \(C_p\) (AIC) , BIC, or adjusted \(R^2\).

Like forward stepwise selection, the backward selection approach searches through only \(1+p(p+1)/2\) models, and so can be applied in settings where \(p\) is too large to apply best subset selection.
Like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best model containing a subset of the \(p\) predictors.
Backward selection requires that the number of samples \(n\) is larger than the number of variables \(p\) (so that the full model can be fit).
- In contrast, forward stepwise can be used even when \(n<p\), and so is the only variable subset method when \(p\) is very large.

7.1.4 Choosing the Optimal Model

The model containing all of the predictors will always havethe smallest RSS and the largest \(R^2\), since these quantities are related to the training error.
We wish to choose a model with low test error, not a model with low training error.
- Recall that training error is usually a poor estimate of test error.
Therefore, RSS and \(R^2\) are not suitable for selecting the best model among a collection of models with different numbers of predictors.

Estimating test error: two approaches

We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting.
We can directly estimate the test error, using either a validation set approach or a cross-validation approach.

\(C_p\), AIC, BIC, and Adjusted \(R^2\)

These techniques adjust the training error for the model size, can be used to select among a set of models with different numbers of variables.
The next figure displays \(C_p\), BIC and adjusted \(R^2\) for the best model of each size produced by bet subset selection on the Credit data set.

Mallow’s \(C_p\):

\[ C_p = \frac{1}{n}(RSS+2d\hat{\sigma}^2) \]

where \(d\) is the total # of parameters used and \(\hat{\sigma}^2\) is an estimate of the variance of the error \(\epsilon\) associated with each response measurement.

The AIC criterion is defined for a large class of models fit by maximum likelihood:

\[ AIC=-2logL+2\cdot d \]

where \(L\) is the maximized value of the likelihood function for the etimated model.

In the case of the linear model with Gaussian errors, maximum likelihood and least squares are the same thing, and \(C_p\) and AIC are equivalent.
BIC:

\[ BIC=\frac{1}{n}(RSS+log(n)d\hat{\sigma}^2) \]

Like \(C_p\), the BIC will tend to take on a small value for a model with a low test error, and so generally we select the model that has the lowest BIC value.
Notice that BIC replaces the \(2d\hat{\sigma}^2\) used by \(C_p\) with a \(log(n)d\hat{\sigma}^2\) term, where \(n\) is the number of observations.
Since \(logn>2\) for any \(n>7\), the BIC statistic generally places a heavier penalty on models with many variables, and hence results in the selection of smaller models than \(C_p\).
For a least squares model with \(d\) variables, the adjusted \(R^2\) statistic is calculated as

\[ Adjusted \,\,\,R^2 =1-\frac{RSS/(n-d-1)}{TSS/(n-1)} \]

where TSS is the total sum of squares.

Unlike \(C_p\), AIC, and BIC, for which a small value indicates a model with a low test error, a large value of adjusted \(R^2\) indicates a model with a small test error.
Maximizing the adjusted \(R^2\) is equivalent to minimizing \(\frac{RSS}{n-d-1}\).
- While RSS always decreases as the number of variables in the model increases, \(\frac{RSS}{n-d-1}\) may increase or decrease, due to the presence of \(d\) in the denominator.
Unlike the \(R^2\) statistic, the adjusted \(R^2\) statistic pays a price for the inclusion of unnecessary variables in the model.

Validation and Cross-validation

Each of the procedures returns a sequence of models \(M_k\) indexed by model size \(k=0,1,2,\ldots\).
- Our job here is to select \(\hat{k}\).
- Once selected, we will return model \(M_{\hat{k}}\)
We compute the validation set error or the cross-validation error for each model \(M_k\) under consideration, and then select the \(k\) for which the resulting estimated test error is smallest.
This procedure has an advantage relative to AIC, BIC, \(C_p\), and adjusted \(R^2\), in that it provides a direct estimate of the test error, and doesn’t require an estimate of the error variance \(\sigma^2\).
It can also be used in a wider range of model selection tasks, even in caes where it is hard to pinpoint the model degrees of freedom (e.g. the number of predictors in the model) or hard to estimate the error variance \(\sigma^2\)

The validation errors were calculated by randomly selecting three-quarters of the observations as the training set, and the remainder as the validation set.
The cross-validation errors were computed using \(k=10\) folds.
- In this case, the validation and cross-validation methods both result in a six-variable model.
However, all three approaches suggest that the four-, five-, and six-variable models are roughly equivalent in terms of their test errors.
In this setting, we can select a model using the one-standard-error rule.
- We first calculate the standard error of the estimated test MSE for each model size, and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve.
- What is the rationale for this?

7.1.5 Example

Consider the Hitters data set in the ISLR2 package which contains data on baseball players.
We predict a player’s Salary on the basis of available background data.

library(glmnet) # package for ridge and lasso (and elastic net, i.e., part lasso part ridge)
library(leaps)
library(ISLR2)
names(Hitters) # variables in the data set

##  [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"       "Walks"    
##  [7] "Years"     "CAtBat"    "CHits"     "CHmRun"    "CRuns"     "CRBI"     
## [13] "CWalks"    "League"    "Division"  "PutOuts"   "Assists"   "Errors"   
## [19] "Salary"    "NewLeague"

#help(Hitters) # more detailed descriotion of the data set inluding variable descriptions
head(Hitters) # Hitters data example data in the package

##                   AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Andy Allanson      293   66     1   30  29    14     1    293    66      1
## -Alan Ashby         315   81     7   24  38    39    14   3449   835     69
## -Alvin Davis        479  130    18   66  72    76     3   1624   457     63
## -Andre Dawson       496  141    20   65  78    37    11   5628  1575    225
## -Andres Galarraga   321   87    10   39  42    30     2    396   101     12
## -Alfredo Griffin    594  169     4   74  51    35    11   4408  1133     19
##                   CRuns CRBI CWalks League Division PutOuts Assists Errors
## -Andy Allanson       30   29     14      A        E     446      33     20
## -Alan Ashby         321  414    375      N        W     632      43     10
## -Alvin Davis        224  266    263      A        W     880      82     14
## -Andre Dawson       828  838    354      N        E     200      11      3
## -Andres Galarraga    48   46     33      N        E     805      40      4
## -Alfredo Griffin    501  336    194      A        W     282     421     25
##                   Salary NewLeague
## -Andy Allanson        NA         A
## -Alan Ashby        475.0         N
## -Alvin Davis       480.0         A
## -Andre Dawson      500.0         N
## -Andres Galarraga   91.5         N
## -Alfredo Griffin   750.0         A

dim(Hitters) # nobs n vars

## [1] 322  20

str(Hitters) # structure (shows typse of variables)

## 'data.frame':    322 obs. of  20 variables:
##  $ AtBat    : int  293 315 479 496 321 594 185 298 323 401 ...
##  $ Hits     : int  66 81 130 141 87 169 37 73 81 92 ...
##  $ HmRun    : int  1 7 18 20 10 4 1 0 6 17 ...
##  $ Runs     : int  30 24 66 65 39 74 23 24 26 49 ...
##  $ RBI      : int  29 38 72 78 42 51 8 24 32 66 ...
##  $ Walks    : int  14 39 76 37 30 35 21 7 8 65 ...
##  $ Years    : int  1 14 3 11 2 11 2 3 2 13 ...
##  $ CAtBat   : int  293 3449 1624 5628 396 4408 214 509 341 5206 ...
##  $ CHits    : int  66 835 457 1575 101 1133 42 108 86 1332 ...
##  $ CHmRun   : int  1 69 63 225 12 19 1 0 6 253 ...
##  $ CRuns    : int  30 321 224 828 48 501 30 41 32 784 ...
##  $ CRBI     : int  29 414 266 838 46 336 9 37 34 890 ...
##  $ CWalks   : int  14 375 263 354 33 194 24 12 8 866 ...
##  $ League   : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
##  $ PutOuts  : int  446 632 880 200 805 282 76 121 143 0 ...
##  $ Assists  : int  33 43 82 11 40 421 127 283 290 0 ...
##  $ Errors   : int  20 10 14 3 4 25 7 9 19 0 ...
##  $ Salary   : num  NA 475 480 500 91.5 750 70 100 75 1100 ...
##  $ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...

Consider first selecting the best subset with respect to a given criterion.
Earlier we utilized the car package. Here we utilize leaps package.
Function regsubsets(), which is part of the package, can be used to identify the best subset in terms of the residual sum of squares, RSS.

summary(Hitters) # basic summary statistics

##      AtBat            Hits         HmRun            Runs       
##  Min.   : 16.0   Min.   :  1   Min.   : 0.00   Min.   :  0.00  
##  1st Qu.:255.2   1st Qu.: 64   1st Qu.: 4.00   1st Qu.: 30.25  
##  Median :379.5   Median : 96   Median : 8.00   Median : 48.00  
##  Mean   :380.9   Mean   :101   Mean   :10.77   Mean   : 50.91  
##  3rd Qu.:512.0   3rd Qu.:137   3rd Qu.:16.00   3rd Qu.: 69.00  
##  Max.   :687.0   Max.   :238   Max.   :40.00   Max.   :130.00  
##                                                                
##       RBI             Walks            Years            CAtBat       
##  Min.   :  0.00   Min.   :  0.00   Min.   : 1.000   Min.   :   19.0  
##  1st Qu.: 28.00   1st Qu.: 22.00   1st Qu.: 4.000   1st Qu.:  816.8  
##  Median : 44.00   Median : 35.00   Median : 6.000   Median : 1928.0  
##  Mean   : 48.03   Mean   : 38.74   Mean   : 7.444   Mean   : 2648.7  
##  3rd Qu.: 64.75   3rd Qu.: 53.00   3rd Qu.:11.000   3rd Qu.: 3924.2  
##  Max.   :121.00   Max.   :105.00   Max.   :24.000   Max.   :14053.0  
##                                                                      
##      CHits            CHmRun           CRuns             CRBI        
##  Min.   :   4.0   Min.   :  0.00   Min.   :   1.0   Min.   :   0.00  
##  1st Qu.: 209.0   1st Qu.: 14.00   1st Qu.: 100.2   1st Qu.:  88.75  
##  Median : 508.0   Median : 37.50   Median : 247.0   Median : 220.50  
##  Mean   : 717.6   Mean   : 69.49   Mean   : 358.8   Mean   : 330.12  
##  3rd Qu.:1059.2   3rd Qu.: 90.00   3rd Qu.: 526.2   3rd Qu.: 426.25  
##  Max.   :4256.0   Max.   :548.00   Max.   :2165.0   Max.   :1659.00  
##                                                                      
##      CWalks        League  Division    PutOuts          Assists     
##  Min.   :   0.00   A:175   E:157    Min.   :   0.0   Min.   :  0.0  
##  1st Qu.:  67.25   N:147   W:165    1st Qu.: 109.2   1st Qu.:  7.0  
##  Median : 170.50                    Median : 212.0   Median : 39.5  
##  Mean   : 260.24                    Mean   : 288.9   Mean   :106.9  
##  3rd Qu.: 339.25                    3rd Qu.: 325.0   3rd Qu.:166.0  
##  Max.   :1566.00                    Max.   :1378.0   Max.   :492.0  
##                                                                     
##      Errors          Salary       NewLeague
##  Min.   : 0.00   Min.   :  67.5   A:176    
##  1st Qu.: 3.00   1st Qu.: 190.0   N:146    
##  Median : 6.00   Median : 425.0            
##  Mean   : 8.04   Mean   : 535.9            
##  3rd Qu.:11.00   3rd Qu.: 750.0            
##  Max.   :32.00   Max.   :2460.0            
##                  NA's   :59

summary(lm(Salary ~ ., data = Hitters)) # full regression with all variables

## 
## Call:
## lm(formula = Salary ~ ., data = Hitters)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -907.62 -178.35  -31.11  139.09 1877.04 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  163.10359   90.77854   1.797 0.073622 .  
## AtBat         -1.97987    0.63398  -3.123 0.002008 ** 
## Hits           7.50077    2.37753   3.155 0.001808 ** 
## HmRun          4.33088    6.20145   0.698 0.485616    
## Runs          -2.37621    2.98076  -0.797 0.426122    
## RBI           -1.04496    2.60088  -0.402 0.688204    
## Walks          6.23129    1.82850   3.408 0.000766 ***
## Years         -3.48905   12.41219  -0.281 0.778874    
## CAtBat        -0.17134    0.13524  -1.267 0.206380    
## CHits          0.13399    0.67455   0.199 0.842713    
## CHmRun        -0.17286    1.61724  -0.107 0.914967    
## CRuns          1.45430    0.75046   1.938 0.053795 .  
## CRBI           0.80771    0.69262   1.166 0.244691    
## CWalks        -0.81157    0.32808  -2.474 0.014057 *  
## LeagueN       62.59942   79.26140   0.790 0.430424    
## DivisionW   -116.84925   40.36695  -2.895 0.004141 ** 
## PutOuts        0.28189    0.07744   3.640 0.000333 ***
## Assists        0.37107    0.22120   1.678 0.094723 .  
## Errors        -3.36076    4.39163  -0.765 0.444857    
## NewLeagueN   -24.76233   79.00263  -0.313 0.754218    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 315.6 on 243 degrees of freedom
##   (결측으로 인하여 59개의 관측치가 삭제되었습니다.)
## Multiple R-squared:  0.5461, Adjusted R-squared:  0.5106 
## F-statistic: 15.39 on 19 and 243 DF,  p-value: < 2.2e-16

Significant \(t\)-values suggest that only AtBat, Hits, Walks, CWalks, Division, and PutOuts (and possibly CRuns and Assists that are 10% significant) have explanatory power.

fit.full <- regsubsets(Salary ~ ., data = Hitters,  nvmax = 19) # all subsets
(sm.full <- summary(fit.full)) # with smallest RSS(k), k = 1, ..., p, indicate variables included

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19)
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: exhaustive
##           AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 4  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 5  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
## 7  ( 1 )  " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "   " " 
## 8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"   " " 
## 9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"   "*" 
## 16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*" 
## 17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*" 
## 18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"   "*" 
## 19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"   "*" 
##           CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 )  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
## 9  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
## 10  ( 1 ) "*"    " "     "*"       "*"     "*"     " "    " "       
## 11  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "       
## 12  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "       
## 13  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 14  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 15  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 16  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 17  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 18  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 19  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"

Asterisk indicates inclusion of a variable in a model.

names(sm.full)

## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

Printing for example \(R\)-squares shows the highest values for the best combinations of \(k\) explanatory variables, \(k=1,\ldots,p\) (\(p=19\)).

round(sm.full$rsq, digits = 3) # show R-squares (in 3 decimals)

##  [1] 0.321 0.425 0.451 0.475 0.491 0.509 0.514 0.529 0.535 0.540 0.543 0.544
## [13] 0.544 0.545 0.545 0.546 0.546 0.546 0.546

Thus, for example the highest \(R^2\) with one explanatory variable is \(32.1\%\) when CRBI is in the regression.
Plotting \(R^2\), adjusted \(R^2\), \(C_p\) (here equivalent to AIC), and BIC for all subset sizes can be used to decide the final model.

par(mfrow = c(2, 2))
plot(x = 1:19, y = sm.full$rsq, type = "l", col = "steel blue", main = "R-squares",
     xlab = "N of Variables", ylab = "R-squared")
plot(x = 1:19, y = sm.full$adjr2, type = "l", col = "steel blue", main = "Adjusted R-squares",
     xlab = "N of Variables", ylab = "Adjusted R-squared")
(k.best <- which.max(sm.full$adjr2)) # model with best adj R-square

## [1] 11

points(k.best, sm.full$adjr2[k.best], col = "red", cex = 2, pch = 20) # show the maximum
plot(x = 1:19, y = sm.full$cp, type = "l", col = "steel blue", main = "Cp",
     xlab = "N of Variables", ylab = "Cp")
(k.best <- which.min(sm.full$cp)) # model with the smallest Cp

## [1] 10

points(k.best, sm.full$cp[k.best], col = "red", cex = 2, pch = 20) # show the minimum
plot(x = 1:19, y = sm.full$bic, type = "l", col = "steel blue", main = "BIC",
     xlab = "N of Variables", ylab = "BIC")
(k.best <- which.min(sm.full$bic)) # model with the smallest BIC

## [1] 6

points(k.best, sm.full$bic[k.best], col = "red", cex = 2, pch = 20)

For example the six variables selected by BIC and the corresponding fitted model are:

names(coef(fit.full, id = 6, vcov = TRUE))

## [1] "(Intercept)" "AtBat"       "Hits"        "Walks"       "CRBI"       
## [6] "DivisionW"   "PutOuts"

summary(fit6 <- lm(Salary ~ AtBat + Hits + Walks + CRBI + Division + PutOuts, data = Hitters))

## 
## Call:
## lm(formula = Salary ~ AtBat + Hits + Walks + CRBI + Division + 
##     PutOuts, data = Hitters)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -873.11 -181.72  -25.91  141.77 2040.47 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   91.51180   65.00006   1.408 0.160382    
## AtBat         -1.86859    0.52742  -3.543 0.000470 ***
## Hits           7.60440    1.66254   4.574 7.46e-06 ***
## Walks          3.69765    1.21036   3.055 0.002488 ** 
## CRBI           0.64302    0.06443   9.979  < 2e-16 ***
## DivisionW   -122.95153   39.82029  -3.088 0.002239 ** 
## PutOuts        0.26431    0.07477   3.535 0.000484 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 319.9 on 256 degrees of freedom
##   (결측으로 인하여 59개의 관측치가 삭제되었습니다.)
## Multiple R-squared:  0.5087, Adjusted R-squared:  0.4972 
## F-statistic: 44.18 on 6 and 256 DF,  p-value: < 2.2e-16

The are differences with those significant in the full model (e.g. initially non-significant CRBI is included while initially significant CWalks is not).
In particular if our objective is to use the model for prediction, we can use a validation set to identify the best predictors.
First split the sample to a test set and a validation set, estimate the best explanatory variable subsets of sizes \(1,2,\ldots,p\) using the estimation data set and find the minimum test set MSE.

sum(is.na(Hitters$Salary)) # number of missing

## [1] 59

Hitters <- na.omit(Hitters) # drop all rows with missing values
sum(is.na(Hitters))

## [1] 0

set.seed(1) # initialize random seed for exact replication
## a random vector of TRUEs and FALSEs with length equaling the rows in Hitters
## and about one half of the values are TRUE
train <- sample(x = c(TRUE, FALSE), size = nrow(Hitters), replace = TRUE) # 
## in train about one half are TRUE values and one half FALSE values
mean(train) # proportion of TRUEs

## [1] 0.5095057

## Hitter[train, ] pick lines with train equaling TRUE, which we used for estimation
test <- !train # complement set to identify the test set
mean(test) # fraction of observations in the test set

## [1] 0.4904943

fit.best <- regsubsets(Salary ~ ., Hitters[train, ], nvmax = 19) # best fitting models
## next compute test set MSE for each best fitting explanatory variable subset
## for the purpose we first generate the test set into an appropriate format
## with model.matrix() function (see help(model.matrix))
test.mat <- model.matrix(Salary ~ ., data = Hitters[test, ]) # generate model matrix
head(test.mat) # a few first lines

##                   (Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat
## -Alvin Davis                1   479  130    18   66  72    76     3   1624
## -Alfredo Griffin            1   594  169     4   74  51    35    11   4408
## -Andre Thornton             1   401   92    17   49  66    65    13   5206
## -Alan Trammell              1   574  159    21  107  75    59    10   4631
## -Buddy Biancalana           1   190   46     2   24   8    15     5    479
## -Bruce Bochy                1   127   32     8   16  22    14     8    727
##                   CHits CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts
## -Alvin Davis        457     63   224  266    263       0         1     880
## -Alfredo Griffin   1133     19   501  336    194       0         1     282
## -Andre Thornton    1332    253   784  890    866       0         0       0
## -Alan Trammell     1300     90   702  504    488       0         0     238
## -Buddy Biancalana   102      5    65   23     39       0         1     102
## -Bruce Bochy        180     24    67   82     56       1         1     202
##                   Assists Errors NewLeagueN
## -Alvin Davis           82     14          0
## -Alfredo Griffin      421     25          0
## -Andre Thornton         0      0          0
## -Alan Trammell        445     22          0
## -Buddy Biancalana     177     16          0
## -Bruce Bochy           22      2          1

The model.matrix() function generates constant vector for the intercept and transforms factor variables to 0/1 dummy vectors by indicating also which class is labeled by 1.

test.mse <- double(19) # vector of length 19 for validation set MSEs
for (i in 1:length(test.mse)) {
    betai <- coef(object = fit.best, id = i) # extract coefficients of the model with k x-vars
    pred.salary <- test.mat[, names(betai)] %*% betai # pred y  = X beta
    test.mse[i] <- mean((Hitters$Salary[test] - pred.salary)^2)
} # end for
test.mse # print results

##  [1] 164377.3 144405.5 152175.7 145198.4 137902.1 139175.7 126849.0 136191.4
##  [9] 132889.6 135434.9 136963.3 140694.9 140690.9 141951.2 141508.2 142164.4
## [17] 141767.4 142339.6 142238.2

which.min(test.mse) # find the minimum

## [1] 7

coef(fit.best, id = 7) # slope coefficients of the best fitting model

##  (Intercept)        AtBat         Hits        Walks        CRuns       CWalks 
##   67.1085369   -2.1462987    7.0149547    8.0716640    1.2425113   -0.8337844 
##    DivisionW      PutOuts 
## -118.4364998    0.2526925

coef(fit.full, id = 7) # slope coefficients of the best set of 7 variables from the full data set

##  (Intercept)         Hits        Walks       CAtBat        CHits       CHmRun 
##   79.4509472    1.2833513    3.2274264   -0.3752350    1.4957073    1.4420538 
##    DivisionW      PutOuts 
## -129.9866432    0.2366813

Thus, the best model is the one with 7 predictors.
This set of predictors is also the best model with 10 perdictors from the full data, and would be the results selected by the \(C_p\) criterion.
The results, however, can different for a different training and test sets (actually if we initialized the random generator with set.seed(1), a slightly different set of predictors would have been selected).

Remark

The practice is that the size \(k\) of the best set of predictors (here \(k=7\)) is selected on the basis of the validation approach, while the final best \(k\) predictors are selected from the full sample and the corresponding regression is estimated (again from the full sample).
Thus, predictors in the final model may differ from those of the validation best predictors, only the number is the same (above in both cases the sets of best predictors happened to coincide).

Cross-validation approach

In the same manner as in the validation set approach, we can identify the size the set of best predictors on the basis of cross-validation.
We demonstrate here the \(k\)-fold CV with \(k=10\) (here \(k\) refers to the number of folds, not the number of predictors).
First create a vector that indicates in which of the 10 groups each observation belong to.

## k-fold cross-validation approach
set.seed(1) # for exact replication
k <- 10 # n of folds
folds <- sample(x = 1:k, size = nrow(Hitters), replace = TRUE) # randomly formed k groups
head(folds)

## [1] 9 4 7 1 2 7

str(fit.best) # structure of regsubsets object

## List of 28
##  $ np       : int 20
##  $ nrbar    : int 190
##  $ d        : num [1:20] 1.34e+02 8.80e+06 3.32e+01 8.89e+03 8.80e+02 ...
##  $ rbar     : num [1:190] 265.94 0.478 12.067 7.44 341.619 ...
##  $ thetab   : num [1:20] 567.316 0.985 33.726 6.76 5.944 ...
##  $ first    : int 2
##  $ last     : int 20
##  $ vorder   : int [1:20] 1 14 20 4 8 13 16 9 15 6 ...
##  $ tol      : num [1:20] 5.79e-09 5.71e-06 9.85e-09 1.47e-07 8.81e-08 ...
##  $ rss      : num [1:20] 26702538 18174675 18136864 17730750 17699645 ...
##  $ bound    : num [1:20] 26702538 15786734 12857104 11254676 10720451 ...
##  $ nvmax    : int 20
##  $ ress     : num [1:20, 1] 26702538 15786734 12857104 11254676 10720451 ...
##  $ ir       : int 20
##  $ nbest    : int 1
##  $ lopt     : int [1:210, 1] 1 1 12 1 12 3 1 7 10 9 ...
##  $ il       : int 210
##  $ ier      : int 0
##  $ xnames   : chr [1:20] "(Intercept)" "AtBat" "Hits" "HmRun" ...
##  $ method   : chr "exhaustive"
##  $ force.in : Named logi [1:20] TRUE FALSE FALSE FALSE FALSE FALSE ...
##   ..- attr(*, "names")= chr [1:20] "" "AtBat" "Hits" "HmRun" ...
##  $ force.out: Named logi [1:20] FALSE FALSE FALSE FALSE FALSE FALSE ...
##   ..- attr(*, "names")= chr [1:20] "" "AtBat" "Hits" "HmRun" ...
##  $ sserr    : num 8924127
##  $ intercept: logi TRUE
##  $ lindep   : logi [1:20] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ nullrss  : num 26702538
##  $ nn       : int 134
##  $ call     : language regsubsets.formula(Salary ~ ., Hitters[train, ], nvmax = 19)
##  - attr(*, "class")= chr "regsubsets"

Thus, here the first observation falls into fold 9, the second into 4, etc.
The following function produces predictions.

predict.regsubsets <- function(obj, # object produced by regsubset()
                               newdata, # out of sample data
                               id, # id of the model
                               ... # potential additional argumets if needed
                               ) {  # Source: James et al. (2013) ISL
    if (class(obj) != "regsubsets") stop("obj must be produced by regsubsets() function!")
    fmla <- as.formula(obj$call[[2]]) # extract formula from the obj object
    beta <- coef(object = obj, id = id) # coefficients corresponding to model id
    xmat <- model.matrix(fmla, newdata) # data matrix for prediction computations
    return(xmat[, names(beta)] %*% beta) # return predictions
} # pred.regsubsets

Next we loop through the \(k\) sets, and best predictions sets to compute MSEs.

cv.mse <- matrix(nrow = k, ncol = 19, dimnames = list(1:k, 1:19)) # matrix to store MSE-values
## for loops to produce MSEs
for (i in 1:k) { # over validation sets
    best.fits <- regsubsets(Salary ~ ., data = Hitters[folds != i, ], nvmax = 19)
    y <- Hitters$Salary[folds == i] # new y values
    for (j in 1:19) { # MSEs over n of predictors
        ypred <- predict.regsubsets(best.fits, newdata = Hitters[folds == i, ], id = j) # predictions
        cv.mse[i, j] <- mean((y - ypred)^2) # store MSEs into cv.mse matrix
    } # for j
}

MSEs for a \(j\)-predictors model, \(j=1,\ldots,19\).

(mean.cv.mse <- apply(cv.mse, 2, mean)) # mean mse values, surrounding with parentheses prints the results

##        1        2        3        4        5        6        7        8 
## 149821.1 130922.0 139127.0 131028.8 131050.2 119538.6 124286.1 113580.0 
##        9       10       11       12       13       14       15       16 
## 115556.5 112216.7 113251.2 115755.9 117820.8 119481.2 120121.6 120074.3 
##       17       18       19 
## 120084.8 120085.8 120403.5

Models with 8, 10 and 11 predictors are close to each other, however, the 10 predictor model is slightly better than the other two as shown also by the following figure, so a model with 10 predictors would be our choice.

par(mfrow = c(1, 1))
plot(x = 1:19, y = mean.cv.mse, type = "b", col = "red", xlab = "N of Predictors", ylab = "MSE",
     main = "Best 10-fold Cross-Validation MSEs\nfor Different Number of Predictors")

## dev.print(pdf, "../lectures/figures/ex51b.pdf")
coef(regsubsets(Salary ~ ., data = Hitters, nvmax = 19), id = 10)

##  (Intercept)        AtBat         Hits        Walks       CAtBat        CRuns 
##  162.5354420   -2.1686501    6.9180175    5.7732246   -0.1300798    1.4082490 
##         CRBI       CWalks    DivisionW      PutOuts      Assists 
##    0.7743122   -0.8308264 -112.3800575    0.2973726    0.2831680

7.2 Shrinkage Methods

Ridge regression and Lasso
- The subset selection methods use least squares to fit a linear model that contains a subset of the predictors.
- As an alternative, we can fit a model containing all \(p\) predictors using a technique that constrains or regularized the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero.
- It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance.

7.2.1 Ridge Regression

Recall that the least squares fitting procedure estimates \(\beta_0,\beta_1,\ldots, \beta_p\) using the values that minimize

\[ RSS=\sum_{i=1}^n (y_i -\beta_0 - \sum_{j=1}^p \beta_j x_{ij} )^2 \]

In contrast, the ridge regression coefficient estimates \(\hat{\beta}^R\) are the values that minimize

\[ \sum_{i=1}^n (y_i -\beta_0 - \sum_{j=1}^p \beta_j x_{ij} )^2+\lambda \sum_{j=1}^p \beta_j^2 = RSS +\lambda\sum_{j=1}^p \beta_j^2 \]

where \(\lambda\ge 0\) is a tuning parameter, to be determined separately.

As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small.
However, the second term, \(\lambda \sum_j\beta_j^2\), called a shrinkage penalty, is small when \(\beta_1, \ldots, \beta_p\) are close to zero, and so it has the effect of shrinking the estimats of \(\beta_j\) towards zero.
The tuning parameter \(\lambda\) serves to control the relative impact of these two terms on the regression coefficient estimates.
Selecting a good value for \(\lambda\) is critical; cross-validation is used for this.

In the left-hand panel, each curve corresponds to the ridge regression coefficient estimate for one of the ten variables, plotted as a function of \(\lambda\)
The right-hand panel displays the same ridge coefficient estimates as the left-hand panel, but instead of displaying \(\lambda\) on the \(x\)-axis, we now display \(||\hat{\beta}_{\lambda}^R||_2/||\hat{\beta}||_2\), where \(\hat{\beta}\) denotes the vector of least squares coefficient estimates.
The notation \(||\beta||_2\) denotes the \(l_2\) norm (pronounced “ell 2”) of a vector, and is defined as \(||\beta||_2=\sqrt{\sum_{j=1}^p \beta_j^2}\).

Scaling of predictors

The standard least squares coefficient estimates are scale equivariant: multiplying \(X_j\) by a constant \(c\) simply leads to a scaling of the least squares coefficient estimates by a factor of \(1/c\).
- In other words, regardless of how the \(j\)th predictor is scaled, \(X_j\hat{\beta}_j\) will remain the same.
In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty part of the ridge regression objective function.
Therefore, it is best to apply ridge regression after standardizing the predictors, using the formula

\[ \tilde{x}_{ij}=\frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n (x_{ij}-\bar{x}_j)^2}} \]

The Bias-variance tradeoff

Simulated data with \(n=50\) observations, \(p=45\) predictors, all having nonzero coefficients.
- Squared bias (black), variance (green), and test mean squared error (purple) for the ridge regression predictions on a simulated data set, as a function of \(\lambda\) and \(||\hat{\beta}_{\lambda}^R||_2/||\hat{\beta}||_2\).
- The horizontal dashed lines indicate the minimum possible MSE.
- The purple crosses indicate the ridge regression models for which the MSE is smallest.

7.2.2 RASSO Regression

Ridge regression does have on obvious disadvantage:
- Unlike subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all \(p\) prediction in the final model
The LASSO (least absolute shrinkage and selection operator) is a relatively recent alternative to ridge regression that overcomes this disadvantage.
- The lasso coefficients, \(\hat{\beta}_{\lambda}^L\), minimize the quantity

\[ \sum_{i=1}^n (y_i -\beta_0 - \sum_{j=1}^p \beta_j x_{ij} )^2+\lambda \sum_{j=1}^p |\beta_j| = RSS +\lambda\sum_{j=1}^p |\beta_j| \]

In statistical parlance, the lasso uses an \(l_1\) (pronounced “ell 1”) penalty instead of an \(l_2\) penalty.
- The \(l_1\) norm of a coefficient vector \(\beta\) is given by \(||\beta||_1=\sum_j|\beta_j|\)
As with ridge regression, the lasso shrinks the coefficient estimates towards zero.
However, in the case of the lasso, the \(l_1\) penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter \(\lambda\) is sufficiently large.
Hence, much like best subset selection, the lasso performs variable selection.
We say that the lasso yields sparse models - that is, models that involve only a subset of the variables.
As in ridge regression, selecting a good value of \(\lambda\) for the lasso is critical; cross-validation is again the method of choice.

Why is it that the lasso, unlike ridge regression, results in coefficient estimates that are exactly equal to zero?
One can show that the lasso and ridge regression coefficient estimates solve the problems

\[ minimize_{\beta}\sum_{i=1}^n (y_i -\beta_0 - \sum_{j=1}^p \beta_j x_{ij} )^2 \,\,\, subject \,\,\, to \,\,\, \sum_{j=1}^p |\beta_j|\le s \]

and

\[ minimize_{\beta}\sum_{i=1}^n (y_i -\beta_0 - \sum_{j=1}^p \beta_j x_{ij} )^2 \,\,\, subject \,\,\, to \,\,\, \sum_{j=1}^p \beta_j^2\le s \]

respectively.

Comparing the Lasso and Ridge regression

Left: plots of squared bias (black), variance (green), and test MSE (purple) for the lasso on simulated data set.
Right: Comparison of squared bias, variance and test MSE between lasso (solid) and ridge (dashed).
- Both are plotted against their \(R^2\) on the training data, as a common form of indexing.
- The crosses in both plots indicate the lasso model for which the MSE is smallest.

Left: Plots of squared bias (black), variance (green), and test MSE (purple) for the lasso.
- The simulated data is similar to that prior, except that now only tow predictors are related to the response.
Right: Comparison of squared bias, variance and test MSE between lasso (solid) and ridge (dashed).
- Both are plotted against their \(R^2\) on the training data, as a common form of indexing.
- The crosses in both plots indicate the lasso model for which the MSE is smallest

Conclusions

These two examples illustrate that neither ridge regression nor the lasso will universally dominate the other.
In general, one might expect the lasso to perform better when the response is a function of only a relatively small number of predictors.
However, the number of predictors that is related to the response is never known a priori for real data sets.
A technique such as cross-validation can be used in order to determine which approach is better on a particular data set.

Selecting the tuning parameter for ridge regression and lasso

As for subset selection, for ridge regression and lasso we require a method to determine which of the models under consideration is best.
That is, we require a method selecting a value for the tuning parameter \(\lambda\) or equivalently, the value of the constraint \(s\).
Cross-validation provides a simple way to tackle this problem.
- We choose a grid of \(\lambda\) values, and compute the cross-validation error rate for each value of \(\lambda\).
We then select the tuning parameter value for which the cross-validation error is smallest.
Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter.

Left: Cross-validation errors that result from applying ridge regression to the Credit data set with various values of \(\lambda\).
Right: The coefficient estimates as a function of \(\lambda\).
- The vertical dashed lines indicates the value of \(\lambda\) selected by cross-validation.

Left: Ten-fold cross-validation MSE for the lasso, applied to the sparse simulated data set from prior.
Right: The corresponding lasso coefficient estimates are displayed.
- The vertical dashed lines indicate the lasso fit for which the cross-validation error is smallest.

Remark

A combination of the ridge regression and the lasso, called elastic-net regularization, estimates the regressions coefficients \(\beta_0, \beta_1, \beta_2, \ldots, \beta_p\) by minimizing

\[ \sum_{i=1}^n (y_i - \beta_0-x_i^T\beta)^2 + \lambda P_{\alpha}(\beta) \]

where \(x_i=(x_{i1},\ldots,x_{ip})^T\), \(\beta=(\beta_1,\ldots,\beta_p)^T\), and

\[ P_{\alpha}(\beta)=\sum_{j=1}^p (\frac{1}{2}(1-\alpha)\beta_j^2+\alpha|\beta_j|) \]

Thus, \(\alpha=0\) implies the ridge regression and \(\alpha=1\) the lasso.

Another formulation for ridge regressio and lasso

For example the lasso restriction can be thought as a budget constraint that defines how large \(\sum_{j=1}^p |\beta_j|\) can be.
Formulating the constraints of lasso and ridge as

\[ \sum_{j=1}^p I(\beta_j \ne 0) \le s \]

where \(I(\beta_j \ne 0)\) is an indicator function equaling 1 if \(\beta_j \ne 0\) and zero otherwise, then RSS is minimized under the constraint that no more than \(s\) coefficients can be nonzero.

Thus, the problem becomes a best subset selection problem.

7.2.3 Example

We utilize again the Hitters data set to demonstrate lasso.
R has package glmnet in which the main function to perform lasso, ridge regression, and generally elastic-net regularization estimatin is glmnet().
The glmnet() function does not allow missing values and R factor variables must be first transformed to 0/1 dummy variables.

library(ISLR2)
library(glmnet) # see library(help = glmnet)
library(help = glmnet) # basic information about the glmnet package
head(Hitters) # a few first lines

##                   AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Alan Ashby         315   81     7   24  38    39    14   3449   835     69
## -Alvin Davis        479  130    18   66  72    76     3   1624   457     63
## -Andre Dawson       496  141    20   65  78    37    11   5628  1575    225
## -Andres Galarraga   321   87    10   39  42    30     2    396   101     12
## -Alfredo Griffin    594  169     4   74  51    35    11   4408  1133     19
## -Al Newman          185   37     1   23   8    21     2    214    42      1
##                   CRuns CRBI CWalks League Division PutOuts Assists Errors
## -Alan Ashby         321  414    375      N        W     632      43     10
## -Alvin Davis        224  266    263      A        W     880      82     14
## -Andre Dawson       828  838    354      N        E     200      11      3
## -Andres Galarraga    48   46     33      N        E     805      40      4
## -Alfredo Griffin    501  336    194      A        W     282     421     25
## -Al Newman           30    9     24      N        E      76     127      7
##                   Salary NewLeague
## -Alan Ashby        475.0         N
## -Alvin Davis       480.0         A
## -Andre Dawson      500.0         N
## -Andres Galarraga   91.5         N
## -Alfredo Griffin   750.0         A
## -Al Newman          70.0         A

Hitters <- na.omit(Hitters) # remove missing values

The glmnet() function requires its input in an \(X\) matrix (predictors) and a \(Y\) vector (dependent variable), and syntax \(y~x\) does not work.
The model.matrix() function generates the required \(x\)-matrix and transforms factor variables to dummy variables.
glmnet() performs lasso by selecting alpha=1 (which is also the default) and automatically selects a range for \(\lambda\).
Below we use this automatically generated grid for \(\lambda\).

## inputs matrix x and vector y for glmnet()
x <- model.matrix(Salary ~ ., Hitters)[, -1] # x-variables, drop the constant term vector
y <- Hitters$Salary 
head(x)

##                   AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Alan Ashby         315   81     7   24  38    39    14   3449   835     69
## -Alvin Davis        479  130    18   66  72    76     3   1624   457     63
## -Andre Dawson       496  141    20   65  78    37    11   5628  1575    225
## -Andres Galarraga   321   87    10   39  42    30     2    396   101     12
## -Alfredo Griffin    594  169     4   74  51    35    11   4408  1133     19
## -Al Newman          185   37     1   23   8    21     2    214    42      1
##                   CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists Errors
## -Alan Ashby         321  414    375       1         1     632      43     10
## -Alvin Davis        224  266    263       0         1     880      82     14
## -Andre Dawson       828  838    354       1         0     200      11      3
## -Andres Galarraga    48   46     33       1         0     805      40      4
## -Alfredo Griffin    501  336    194       0         1     282     421     25
## -Al Newman           30    9     24       1         0      76     127      7
##                   NewLeagueN
## -Alan Ashby                1
## -Alvin Davis               0
## -Andre Dawson              1
## -Andres Galarraga          1
## -Alfredo Griffin           0
## -Al Newman                 0

grid.lambda <- 10^seq(10, -2, length = 100) # grid of length 100 from 10^10 down to 1/100
head(grid.lambda)

## [1] 10000000000  7564633276  5722367659  4328761281  3274549163  2477076356

tail(grid.lambda)

## [1] 0.04037017 0.03053856 0.02310130 0.01747528 0.01321941 0.01000000

lasso.mod <- glmnet(x = x, y = y, alpha = 1, lambda = grid.lambda) # alpha = 1 peforms lasso
                                          ## for a range of lambda values
str(lasso.mod) # structure of lasso.mod object produced by glmnet()

## List of 12
##  $ a0       : Named num [1:100] 536 536 536 536 536 ...
##   ..- attr(*, "names")= chr [1:100] "s0" "s1" "s2" "s3" ...
##  $ beta     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. ..@ i       : int [1:491] 11 1 10 11 1 5 10 11 1 5 ...
##   .. ..@ p       : int [1:101] 0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ Dim     : int [1:2] 19 100
##   .. ..@ Dimnames:List of 2
##   .. .. ..$ : chr [1:19] "AtBat" "Hits" "HmRun" "Runs" ...
##   .. .. ..$ : chr [1:100] "s0" "s1" "s2" "s3" ...
##   .. ..@ x       : num [1:491] 0.0752 0.1023 0.0683 0.1802 0.7216 ...
##   .. ..@ factors : list()
##  $ df       : int [1:100] 0 0 0 0 0 0 0 0 0 0 ...
##  $ dim      : int [1:2] 19 100
##  $ lambda   : num [1:100] 1.00e+10 7.56e+09 5.72e+09 4.33e+09 3.27e+09 ...
##  $ dev.ratio: num [1:100] 0 0 0 0 0 0 0 0 0 0 ...
##  $ nulldev  : num 53319113
##  $ npasses  : int 2283
##  $ jerr     : int 0
##  $ offset   : logi FALSE
##  $ call     : language glmnet(x = x, y = y, alpha = 1, lambda = grid.lambda)
##  $ nobs     : int 263
##  - attr(*, "class")= chr [1:2] "elnet" "glmnet"

Below are shown the number of values in the automatically generated \(\lambda\)-grid and some of the values.

length(lasso.mod$lambda) # number of values in the lambda range

## [1] 100

c(min = min(lasso.mod$lambda), max = max(lasso.mod$lambda)) # range of lambdas

##   min   max 
## 1e-02 1e+10

head(lasso.mod$lambda) # a few first lamdas

## [1] 10000000000  7564633276  5722367659  4328761281  3274549163  2477076356

tail(lasso.mod$lambda) # a few last lambdas

## [1] 0.04037017 0.03053856 0.02310130 0.01747528 0.01321941 0.01000000

Thus, the regression coefficients are computed for 100 values of \(\lambda\).
These results are stored into the \(20\times 100\) beta matrix in the lasso.mod object and can be extracted directly or using coef() function.

dim(coef(lasso.mod)) # dimension of the coefficient matrix (beta)

## [1]  20 100

lasso.mod$lambda[70] # the 50th value of lambda

## [1] 43.28761

coef(lasso.mod)[, 70] # the corresponding beta estimates

## (Intercept)       AtBat        Hits       HmRun        Runs         RBI 
##  74.7044627   0.0000000   1.6432866   0.0000000   0.0000000   0.0000000 
##       Walks       Years      CAtBat       CHits      CHmRun       CRuns 
##   1.8986354   0.0000000   0.0000000   0.0000000   0.0000000   0.1837056 
##        CRBI      CWalks     LeagueN   DivisionW     PutOuts     Assists 
##   0.3743196   0.0000000   0.0000000 -55.4369609   0.1519453   0.0000000 
##      Errors  NewLeagueN 
##   0.0000000   0.0000000

sum(abs(coef(lasso.mod)[-1, 70])) # the corresponing L1-norm

## [1] 59.68885

The coef() function can be used to produce lasso estimates for any values of \(\lambda\) (the same can be done by the predict() function)

#help(coef.glmnet) # help for coef and predict
drop(coef(lasso.mod, s = 1)) # lasso estimates for any value of lambda (arg s)

##   (Intercept)         AtBat          Hits         HmRun          Runs 
##  152.68965461   -1.92453638    6.77405213    1.25572117   -0.99184280 
##           RBI         Walks         Years        CAtBat         CHits 
##    0.00000000    5.60436806   -7.67134685   -0.06548857    0.00000000 
##        CHmRun         CRuns          CRBI        CWalks       LeagueN 
##    0.23105395    1.12172648    0.56793753   -0.72711075   46.55909441 
##     DivisionW       PutOuts       Assists        Errors    NewLeagueN 
## -116.69918432    0.28193985    0.28808499   -2.85400785  -10.14452831

## the same and more can be done with predict() 
drop(predict(lasso.mod, type = "coefficients", s = 1)) # see help(predict.glmnet)

##   (Intercept)         AtBat          Hits         HmRun          Runs 
##  152.68965461   -1.92453638    6.77405213    1.25572117   -0.99184280 
##           RBI         Walks         Years        CAtBat         CHits 
##    0.00000000    5.60436806   -7.67134685   -0.06548857    0.00000000 
##        CHmRun         CRuns          CRBI        CWalks       LeagueN 
##    0.23105395    1.12172648    0.56793753   -0.72711075   46.55909441 
##     DivisionW       PutOuts       Assists        Errors    NewLeagueN 
## -116.69918432    0.28193985    0.28808499   -2.85400785  -10.14452831

## coefficient plots for different lambda values
par(mfrow = c(1, 2))
plot(lasso.mod, xvar = "lambda", xlab = expression(log(lambda)))
plot(lasso.mod, xvar = "norm", xlab = expression(log(sum(abs(beta[j])))))

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

The coefficients are plotted against \(log(\lambda)\) in the left panel and logarithm of the \(l_1\)-norm of the coefficients in the right panel.
Numbers on the top of the figure indicate the number of non-zero coefficients.
The figure shows that depending of the choice of \(\lambda\), some of the coefficients will be exactly equal to zero.
Next we use cross-validation to identify the best \(\lambda\) in terms of the MSE.
The glmnet has the function cv.glmnet() for the purpose.

set.seed(1) # for replication purposes
train <- sample(1:nrow(x), nrow(x)/2) # random selectio of about half of the observations for a train set
test <- -train # test set consist observations not in the train set
y.test <- y[test] # test set y-vlues
lasso.train <- glmnet(x = x[train, ], y = y[train], alpha = 1, lambda = grid.lambda)
## similat plots as for the above tentative whole data set
par(mfrow = c(1, 2))
plot(lasso.train, xvar = "lambda", xlab = expression(log(lambda)))
plot(lasso.train, xvar = "norm", xlab = expression(log(sum(abs(beta[j])))))

## perform cross-validation in the training set to find lambda candidates
cv.out <- cv.glmnet(x[train, ], y[train], alpha = 1, lambda = grid.lambda) # by default performs 10-fold CV
par(mfrow = c(1, 1)) # reset full plotting window
plot(cv.out) # plots MSEs against lambdas

## dev.print(pdf, file = "ex52b.pdf") # pdf copy of the graph
(best.lam <- cv.out$lambda.min) # lambda for which CV MSE is at minimum

## [1] 8.111308

lasso.pred <- predict(lasso.train, s = best.lam, newx = x[test, ]) # predicted values with coefficients corresponding best.lam
mean((lasso.pred - y.test)^2) # test MSE

## [1] 143890.9

## for comparison compute test MSE for OLS estimated model from the test set
ols.train <- lm(Salary ~ ., data = Hitters, subset = train) # OLS train data estimats
ols.pred <- predict(ols.train, newdata = Hitters[test, ]) # OLS prediction
mean((ols.pred - y.test)^2) # OLS test MSE

## [1] 168593.3

Here lasso outperforms OLS in terms of test MSE.
Below is a scatter plot of test set realized and predicted salaries.

## scatterplots of predicted and realized salaries
par(mfrow = c(1, 1))  # full plot window
plot(x = ols.pred, y = y.test, col = "red", xlim = c(0, 2500), ylim = c(0, 2500), xlab = "Predicted", ylab = "Realized")
abline(lm(y.test ~ ols.pred), col = "red")
points(x = lasso.pred, y = y.test, col = "steel blue")
abline(lm(y.test ~ lasso.pred), col = "steel blue")
abline(a = 0, b = 1, lty = "dashed", col = "gray")
legend("topleft", legend = c("OLS", "Lasso"), col = c("Red", "Steel blue"),
       pch = c(1, 1), bty = "n")

7.3 Dimension Reduction Methods

The methods that we have discussed so far in this chapter have involved fitting linear regression models, via least squares or a shrunken approach, using the original predictors, \(X_1, X_2, \ldots, X_p\).
We now explore a class of approaches that transform the predictors and then fit a least squares model using the transformed variables.
- We will refer to these techniques as dimension reduction methods.
Let \(Z_1, Z_2, \ldots, Z_M\) represent \(M<p\) linear combinations of our original \(p\) predictors.

\[ Z_m = \sum_{j=1}^p \phi_{mj}X_j \]

for some constants \(\phi_{m1},\ldots,\phi_{mp}\).

We can then fit the linear regression model,

\[ y_i = \theta_0 + \sum_{m=1}^M \theta_m z_{im}+\epsilon_i, \,\,\, i=1,\ldots,n \]

using ordinary least squares.

Note that in above model, the regression coefficients are given by \(\theta_0, \theta_1, \ldots, \theta_M\).
- If the constants \(\phi_{m1},\ldots, \phi_{mp}\) are chosen wisely, then such dimension reduction approaches can often outperform OLS regression.
Notice that from definition above,

\[ \sum_{m=1}^M \theta_m z_{im} = \sum_{m=1}^M \theta_m \sum_{j=1}^p \phi_{mj}x_{ij}=\sum_{m=1}^M \sum_{j=1}^p \theta_m \phi_{mj}x_{ij}=\sum_{j=1}^p \beta_j x_{ij} \]

where

\[ \beta_j=\sum_{m=1}^M \theta_m \phi_{mj} \]

Hence model above can be thought of as a special case of the original linear regression model.
Dimension reduction serves to constrain the estimated \(\beta_j\) coefficients, since now they must take the form later.
Can win in the bias-variance trade-off.

7.3.1 Principal Components Regression

Here we apply principal components analysis (PCA) to define the linear combinations of the predictors, for use in our regression.
The first principal component is that (normalized) linear combination of the variables with the largest variance.
The second principal component has largest variance, subject to being uncorrelated with the first.
And so on.
Hence with many correlated original variables, we replace them with a small set of principal components that capture their joint variation.

The population size (pop) and ad spending (ad) for 100 different cities are shown as purple circles.
- The green solid line indicates the first principal component, and the blue dashed line indicates the second principal component.

A subset of the advertising data.
Left: The first principal component, chosen to minimize the sum of the squared perpendicular distances to each point, is shown in green.
- These distances are represented using the black dashed line segments.
Right: The left-hand panel has been rotated so that the first principal component lies on the \(x\)-axis.

Plots of the first principal component scores \(z_{i1}\) versus pop and ad.
- The relationships are strong.

Plots of the second principal component scores \(z_{i2}\) versus pop and ad.
- The relationships are weak.

Application to principal components regression

PCR (principal components regression) was applied to two simulated data set.
- The black, green, and purple lines correspond to squared bias, variance, and test mean squared error, respectively.
- Left and right are from simulated data.

Left: PCR standardized coefficient estimates on the Credit data set for different values of M.
Right: The 10-fold cross validation MSE obtained using PCR, as a function of M.

7.3.2 Example

Consider again the Hitters data set

library(ISLR2)
library(pls) # includes pricipal component regression and more
Hitters <- na.omit(Hitters) # remove missing values
#help(pcr)
set.seed(2) # for exact replication of the results
pcr.fit <- pcr(Salary ~ ., data = Hitters, scale = TRUE, # standardize each predictor
               validation = "CV") # use CV to select M PCs default 10 fold
#help(mvrCv) # info about internal CV used by pcr
summary(pcr.fit) # summary of the CV results RMSEP is the square root of CV MSE

## Data:    X dimension: 263 19 
##  Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV             452    351.9    353.2    355.0    352.8    348.4    343.6
## adjCV          452    351.6    352.7    354.4    352.1    347.6    342.7
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       345.5    347.7    349.6     351.4     352.1     353.5     358.2
## adjCV    344.7    346.7    348.5     350.1     350.7     352.0     356.5
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        349.7     349.4     339.9     341.6     339.2     339.6
## adjCV     348.0     347.7     338.2     339.7     337.2     337.6
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X         38.31    60.16    70.84    79.03    84.29    88.63    92.26    94.96
## Salary    40.63    41.58    42.17    43.22    44.90    46.48    46.69    46.75
##         9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X         96.28     97.26     97.98     98.65     99.15     99.47     99.75
## Salary    46.86     47.76     47.82     47.85     48.10     50.40     50.55
##         16 comps  17 comps  18 comps  19 comps
## X          99.89     99.97     99.99    100.00
## Salary     53.01     53.85     54.61     54.61

The RMSEP is square root of CV MSE, and the smallest value is reached with six PCs, which is also shown by the plot of RMSEPs.
Six components explain 88.63% of the total variation among the predictors.

validationplot(pcr.fit, val.type = "RMSEP") # Plot RMSEPs

Next we will demonstrate in terms MSE how the PCR regression performs by using a test set.
First define the best number of components from the training set, estimate the corresponding regression, and compute test MSE.

set.seed(1) # for exact replication
train <- sample(nrow(Hitters), size = nrow(Hitters) / 2) # training set
test <- -train # test set resulted by not including those in the train set
head(train) # examples of observations in the training set

## [1] 167 129 187  85  79 213

pcr.train <- pcr(Salary ~ ., data = Hitters, subset = train, scale = TRUE,
                 validation = "CV") # training set
summary(pcr.train) # training results summary

## Data:    X dimension: 131 19 
##  Y dimension: 131 1
## Fit method: svdpc
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           428.3    331.5    335.2    329.2    332.1    326.4    330.0
## adjCV        428.3    330.9    334.4    327.5    330.9    325.1    328.6
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       334.6    335.3    335.8     337.5     337.7     342.3     340.3
## adjCV    333.0    333.4    333.3     335.0     335.2     339.6     337.6
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        343.2     346.5     344.2     355.3     361.2     352.5
## adjCV     340.3     343.4     340.9     351.4     356.9     348.3
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X         39.32    61.57    71.96    80.83    85.95    89.99    93.25    95.34
## Salary    43.87    43.93    47.36    47.37    49.52    49.55    49.63    50.98
##         9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X         96.55     97.61     98.28     98.85     99.22     99.53     99.79
## Salary    53.00     53.00     53.02     53.05     53.80     53.85     54.03
##         16 comps  17 comps  18 comps  19 comps
## X          99.91     99.97     99.99    100.00
## Salary     55.85     55.89     56.21     58.62

validationplot(pcr.train, val.type = "RMSEP")

On the basis of the training data CV RMSEP results \(M=7\) principal components yields the best results.

pcr.pred <- predict(pcr.train, Hitters[test, ], ncomp = 7)
head(pcr.pred) # a few first predictions

## , , 7 comps
## 
##                      Salary
## -Alvin Davis       595.5249
## -Andre Dawson     1104.4628
## -Andres Galarraga  454.3308
## -Alfredo Griffin   564.6986
## -Al Newman         160.0433
## -Argenis Salazar    10.8481

mean((pcr.pred  - Hitters$Salary[test])^2) # test MSE

## [1] 140751.3

This test set MSE is slightly smaller that of lasso.
PCR is useful in prediction purposes.
If interpretation of the model is needed PCR results may be difficult to interpret.
Finally, estimating the \(M=7\) components from the full data set yields the following results.

summary(pcr.fit <- pcr(Salary ~ ., data = Hitters, scale = TRUE, ncomp = 7)) # estimate and summarize

## Data:    X dimension: 263 19 
##  Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 7
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X         38.31    60.16    70.84    79.03    84.29    88.63    92.26
## Salary    40.63    41.58    42.17    43.22    44.90    46.48    46.69

7.3.3 Partial Least Squares

PCR identifies linear combinations, or directions, that best represent the predictors \(X_1, \ldots, X_p\).
These directions are identified in an unsupervised way, since the response \(Y\) is not used to help determine the principal component directions.
That is, the response does not supervised the identification of the principal components.
Consequently, PCR suffers from a potentially serious drawback.
- There is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response.
Like PCR, PLS is a dimension reduction method, which first identifies a new set of features \(Z_1, \ldots, Z_M\) that are linear combinations of the original features, and then fits a linear model via OLS using these M new features.
But unlike PCR, PLS identifies these new features in a supervised way - that is, it makes use of the response \(Y\) in order to identify new features that not only approximate the old features well, but also that are related to the response.
Roughly speaking, the PLS approach attempts to find directions that help explain both the response and the predictors.
After standardizing the \(p\) predictors, PLS computes the first direction \(Z_1\) by setting each \(\phi_{1j}\) equal to the coefficient from the simple linear regression of \(Y\) onto \(X_j\).
One can show that this coefficient is proportional to the correlation between \(Y\) and \(X_j\).
Hence, in computing \(Z_1=\sum_{j=1}^p \phi_{1j}X_j\), PLS places the highest weight on the variables that are most strongly related to the response.
Subsequent directions are found by taking residuals and then repeating the above prescription.

Summary

Model selection methods are an essential tool for data analysis, especially for big datasets involving many predictors.
Research into methods that give sparsity, such as the lasso is an especially hot area.

7.3.4 Example

Continuing the previous examples, we use plsr() function of the pls package.

library(ISLR2)
library(pls) # includes pricipal component regression and more
Hitters <- na.omit(Hitters) # remove missing values

set.seed(1) # initialize again the seed
pls.train <- plsr(Salary ~ ., data = Hitters, subset = train, scale = TRUE,
                  validation = "CV")
summary(pls.train)

## Data:    X dimension: 131 19 
##  Y dimension: 131 1
## Fit method: kernelpls
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           428.3    325.5    329.9    328.8    339.0    338.9    340.1
## adjCV        428.3    325.0    328.2    327.2    336.6    336.1    336.6
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       339.0    347.1    346.4     343.4     341.5     345.4     356.4
## adjCV    336.2    343.4    342.8     340.2     338.3     341.8     351.1
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        348.4     349.1     350.0     344.2     344.5     345.0
## adjCV     344.2     345.0     345.9     340.4     340.6     341.1
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X         39.13    48.80    60.09    75.07    78.58    81.12    88.21    90.71
## Salary    46.36    50.72    52.23    53.03    54.07    54.77    55.05    55.66
##         9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X         93.17     96.05     97.08     97.61     97.97     98.70     99.12
## Salary    55.95     56.12     56.47     56.68     57.37     57.76     58.08
##         16 comps  17 comps  18 comps  19 comps
## X          99.61     99.70     99.95    100.00
## Salary     58.17     58.49     58.56     58.62

\(M=2\) produces the smallest CV RMSEP.

validationplot(pls.train, val.type = "RMSEP") # plot of CV RMSEP

##dev.print(pdf, file = "ex54.pdf") # pdf copy

pls.pred <- predict(pls.train, newdata = Hitters[test, ], ncomp = 2)
mean((Hitters$Salary[test] - pls.pred)^2)

## [1] 145367.7

MSE if comparable but slightly higher than that of PCR.