Lab: Classification Methods

The Stock Market Data

We will begin by examining some numerical and graphical summaries of the Smarket data, which is part of the ISLR2 library. This data set consists of percentage returns for the S\&P 500 stock index over $1,250$~days, from the beginning of 2001 until the end of

  1. For each date, we have recorded the percentage returns for each of the five previous trading days, lagone through lagfive. We have also recorded volume (the number of shares traded on the previous day, in billions), Today (the percentage return on the date in question) and direction (whether the market was Up or Down on this date). Our goal is to predict direction (a qualitative response) using the other features.
In [1]:
library(ISLR2)
names(Smarket)
dim(Smarket)
summary(Smarket)
pairs(Smarket)
  1. 'Year'
  2. 'Lag1'
  3. 'Lag2'
  4. 'Lag3'
  5. 'Lag4'
  6. 'Lag5'
  7. 'Volume'
  8. 'Today'
  9. 'Direction'
  1. 1250
  2. 9
      Year           Lag1             Lag2             Lag3       
 Min.   :2001   Min.   :-4.922   Min.   :-4.922   Min.   :-4.922  
 1st Qu.:2002   1st Qu.:-0.639   1st Qu.:-0.639   1st Qu.:-0.640  
 Median :2003   Median : 0.039   Median : 0.039   Median : 0.038  
 Mean   :2003   Mean   : 0.004   Mean   : 0.004   Mean   : 0.002  
 3rd Qu.:2004   3rd Qu.: 0.597   3rd Qu.: 0.597   3rd Qu.: 0.597  
 Max.   :2005   Max.   : 5.733   Max.   : 5.733   Max.   : 5.733  
      Lag4             Lag5            Volume          Today        Direction 
 Min.   :-4.922   Min.   :-4.922   Min.   :0.356   Min.   :-4.922   Down:602  
 1st Qu.:-0.640   1st Qu.:-0.640   1st Qu.:1.257   1st Qu.:-0.639   Up  :648  
 Median : 0.038   Median : 0.038   Median :1.423   Median : 0.038             
 Mean   : 0.002   Mean   : 0.006   Mean   :1.478   Mean   : 0.003             
 3rd Qu.: 0.597   3rd Qu.: 0.597   3rd Qu.:1.642   3rd Qu.: 0.597             
 Max.   : 5.733   Max.   : 5.733   Max.   :3.152   Max.   : 5.733             

The cor() function produces a matrix that contains all of the pairwise correlations among the predictors in a data set. The first command below gives an error message because the direction variable is qualitative.

In [2]:
cor(Smarket)
cor(Smarket[, -9])
Error in cor(Smarket): 'x' must be numeric
Traceback:

1. cor(Smarket)
2. stop("'x' must be numeric")

As one would expect, the correlations between the lag variables and today's returns are close to zero. In other words, there appears to be little correlation between today's returns and previous days' returns. The only substantial correlation is between Year and volume. By plotting the data, which is ordered chronologically, we see that volume is increasing over time. In other words, the average number of shares traded daily increased from 2001 to 2005.

In [3]:
attach(Smarket)
plot(Volume)

Logistic Regression

Next, we will fit a logistic regression model in order to predict direction using lagone through lagfive and volume. The glm() function can be used to fit many types of generalized linear models , including logistic regression. The syntax of the glm() function is similar to that of lm(), except that we must pass in the argument family = binomial in order to tell R to run a logistic regression rather than some other type of generalized linear model.

In [4]:
glm.fits <- glm(
    Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
    data = Smarket, family = binomial
  )
summary(glm.fits)
Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
    Volume, family = binomial, data = Smarket)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
 -1.45   -1.20    1.07    1.15    1.33  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.12600    0.24074   -0.52     0.60
Lag1        -0.07307    0.05017   -1.46     0.15
Lag2        -0.04230    0.05009   -0.84     0.40
Lag3         0.01109    0.04994    0.22     0.82
Lag4         0.00936    0.04997    0.19     0.85
Lag5         0.01031    0.04951    0.21     0.83
Volume       0.13544    0.15836    0.86     0.39

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1731.2  on 1249  degrees of freedom
Residual deviance: 1727.6  on 1243  degrees of freedom
AIC: 1742

Number of Fisher Scoring iterations: 3

The smallest $p$-value here is associated with lagone. The negative coefficient for this predictor suggests that if the market had a positive return yesterday, then it is less likely to go up today. However, at a value of $0.15$, the $p$-value is still relatively large, and so there is no clear evidence of a real association between lagone and direction.

We use the coef() function in order to access just the coefficients for this fitted model. We can also use the summary() function to access particular aspects of the fitted model, such as the $p$-values for the coefficients.

In [5]:
coef(glm.fits)
summary(glm.fits)$coef
summary(glm.fits)$coef[, 4]
(Intercept)
-0.126000256559266
Lag1
-0.0730737458900261
Lag2
-0.0423013440073082
Lag3
0.0110851083796762
Lag4
0.00935893837027874
Lag5
0.0103130684758179
Volume
0.13544065885916
A matrix: 7 × 4 of type dbl
EstimateStd. Errorz valuePr(>|z|)
(Intercept)-0.1260000.24074-0.52340.6007
Lag1-0.0730740.05017-1.45660.1452
Lag2-0.0423010.05009-0.84460.3983
Lag3 0.0110850.04994 0.22200.8243
Lag4 0.0093590.04997 0.18730.8514
Lag5 0.0103130.04951 0.20830.8350
Volume 0.1354410.15836 0.85530.3924
(Intercept)
0.600698319413356
Lag1
0.145227211568647
Lag2
0.398349095427021
Lag3
0.824333346101536
Lag4
0.851444506926454
Lag5
0.83499739049983
Volume
0.39240043320243

The predict() function can be used to predict the probability that the market will go up, given values of the predictors. The type = "response" option tells R to output probabilities of the form $P(Y=1|X)$, as opposed to other information such as the logit. If no data set is supplied to the predict() function, then the probabilities are computed for the training data that was used to fit the logistic regression model. Here we have printed only the first ten probabilities. We know that these values correspond to the probability of the market going up, rather than down, because the contrasts() function indicates that R has created a dummy variable with a 1 for Up.

In [6]:
glm.probs <- predict(glm.fits, type = "response")
glm.probs[1:10]
contrasts(Direction)
1
0.507084133395402
2
0.481467878454591
3
0.481138835214201
4
0.515222355813022
5
0.510781162691538
6
0.506956460534911
7
0.492650874187038
8
0.509229158207377
9
0.517613526170958
10
0.488837779771376
A matrix: 2 × 1 of type dbl
Up
Down0
Up1

In order to make a prediction as to whether the market will go up or down on a particular day, we must convert these predicted probabilities into class labels, Up or Down. The following two commands create a vector of class predictions based on whether the predicted probability of a market increase is greater than or less than $0.5$.

In [7]:
glm.pred <- rep("Down", 1250)
glm.pred[glm.probs > .5] = "Up"

The first command creates a vector of 1,250 Down elements. The second line transforms to Up all of the elements for which the predicted probability of a market increase exceeds $0.5$. Given these predictions, the table() function can be used to produce a confusion matrix in order to determine how many observations were correctly or incorrectly classified. %By inputting two qualitative vectors R will create a two by two table with counts of the number of times each combination occurred e.g. predicted {\it Up} and market increased, predicted {\it Up} and the market decreased etc.

In [8]:
table(glm.pred, Direction)
(507 + 145) / 1250
mean(glm.pred == Direction)
        Direction
glm.pred Down  Up
    Down  145 141
    Up    457 507
0.5216
0.5216

The diagonal elements of the confusion matrix indicate correct predictions, while the off-diagonals represent incorrect predictions. Hence our model correctly predicted that the market would go up on $507$~days and that it would go down on $145$~days, for a total of $507+145 = 652$ correct predictions. The mean() function can be used to compute the fraction of days for which the prediction was correct. In this case, logistic regression correctly predicted the movement of the market $52.2$\,\% of the time.

At first glance, it appears that the logistic regression model is working a little better than random guessing. However, this result is misleading because we trained and tested the model on the same set of $1,250$ observations. In other words, $100\%-52.2\%=47.8\%$, is the training error rate. As we have seen previously, the training error rate is often overly optimistic---it tends to underestimate the test error rate. In order to better assess the accuracy of the logistic regression model in this setting, we can fit the model using part of the data, and then examine how well it predicts the held out data. This will yield a more realistic error rate, in the sense that in practice we will be interested in our model's performance not on the data that we used to fit the model, but rather on days in the future for which the market's movements are unknown.

To implement this strategy, we will first create a vector corresponding to the observations from 2001 through 2004. We will then use this vector to create a held out data set of observations from 2005.

In [9]:
train <- (Year < 2005)
Smarket.2005 <- Smarket[!train, ]
dim(Smarket.2005)
Direction.2005 <- Direction[!train]
  1. 252
  2. 9

The object train is a vector of $1{,}250$ elements, corresponding to the observations in our data set. The elements of the vector that correspond to observations that occurred before 2005 are set to TRUE, whereas those that correspond to observations in 2005 are set to FALSE. The object train is a Boolean vector, since its elements are TRUE and FALSE. Boolean vectors can be used to obtain a subset of the rows or columns of a matrix. For instance, the command Smarket[train, ] would pick out a submatrix of the stock market data set, corresponding only to the dates before 2005, since those are the ones for which the elements of train are TRUE. The ! symbol can be used to reverse all of the elements of a Boolean vector. That is, !train is a vector similar to train, except that the elements that are TRUE in train get swapped to FALSE in !train, and the elements that are FALSE in train get swapped to TRUE in !train. Therefore, Smarket[!train, ] yields a submatrix of the stock market data containing only the observations for which train is FALSE---that is, the observations with dates in 2005. The output above indicates that there are 252 such observations.

We now fit a logistic regression model using only the subset of the observations that correspond to dates before 2005, using the subset argument. We then obtain predicted probabilities of the stock market going up for each of the days in our test set---that is, for the days in 2005.

In [10]:
glm.fits <- glm(
    Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
    data = Smarket, family = binomial, subset = train
  )
glm.probs <- predict(glm.fits, Smarket.2005,
    type = "response")

Notice that we have trained and tested our model on two completely separate data sets: training was performed using only the dates before 2005, and testing was performed using only the dates in 2005. Finally, we compute the predictions for 2005 and compare them to the actual movements of the market over that time period.

In [11]:
glm.pred <- rep("Down", 252)
glm.pred[glm.probs > .5] <- "Up"
table(glm.pred, Direction.2005)
mean(glm.pred == Direction.2005)
mean(glm.pred != Direction.2005)
        Direction.2005
glm.pred Down Up
    Down   77 97
    Up     34 44
0.48015873015873
0.51984126984127

The != notation means not equal to, and so the last command computes the test set error rate. The results are rather disappointing: the test error rate is $52$\,\%, which is worse than random guessing! Of course this result is not all that surprising, given that one would not generally expect to be able to use previous days' returns to predict future market performance. (After all, if it were possible to do so, then the authors of this book would be out striking it rich rather than writing a statistics textbook.)

We recall that the logistic regression model had very underwhelming $p$-values associated with all of the predictors, and that the smallest $p$-value, though not very small, corresponded to lagone. Perhaps by removing the variables that appear not to be helpful in predicting direction, we can obtain a more effective model. After all, using predictors that have no relationship with the response tends to cause a deterioration in the test error rate (since such predictors cause an increase in variance without a corresponding decrease in bias), and so removing such predictors may in turn yield an improvement. Below we have refit the logistic regression using just lagone and lagtwo, which seemed to have the highest predictive power in the original logistic regression model.

In [12]:
glm.fits <- glm(Direction ~ Lag1 + Lag2, data = Smarket,
    family = binomial, subset = train)
glm.probs <- predict(glm.fits, Smarket.2005,
    type = "response")
glm.pred <- rep("Down", 252)
glm.pred[glm.probs > .5] <- "Up"
table(glm.pred, Direction.2005)
mean(glm.pred == Direction.2005)
106 / (106 + 76)
        Direction.2005
glm.pred Down  Up
    Down   35  35
    Up     76 106
0.55952380952381
0.582417582417582

Now the results appear to be a little better: $56\%$ of the daily movements have been correctly predicted. It is worth noting that in this case, a much simpler strategy of predicting that the market will increase every day will also be correct $56\%$ of the time! Hence, in terms of overall error rate, the logistic regression method is no better than the naive approach. However, the confusion matrix shows that on days when logistic regression predicts an increase in the market, it has a $58\%$ accuracy rate. This suggests a possible trading strategy of buying on days when the model predicts an increasing market, and avoiding trades on days when a decrease is predicted. Of course one would need to investigate more carefully whether this small improvement was real or just due to random chance.

Suppose that we want to predict the returns associated with particular values of lagone and lagtwo. In particular, we want to predict direction on a day when lagone and lagtwo equal 1.2 and~1.1, respectively, and on a day when they equal 1.5 and $-$0.8. We do this using the predict() function.

In [13]:
predict(glm.fits,
    newdata =
      data.frame(Lag1 = c(1.2, 1.5),  Lag2 = c(1.1, -0.8)),
    type = "response"
  )
1
0.479146239171912
2
0.496093872956532

Linear Discriminant Analysis

Now we will perform LDA on the Smarket data. In R, we fit an LDA model using the lda() function, which is part of the MASS library. Notice that the syntax for the lda() function is identical to that of lm(), and to that of glm() except for the absence of the family option. We fit the model using only the observations before 2005.

In [14]:
library(MASS)
lda.fit <- lda(Direction ~ Lag1 + Lag2, data = Smarket,
    subset = train)
lda.fit
plot(lda.fit)
Attaching package: ‘MASS’


The following object is masked from ‘package:ISLR2’:

    Boston


Call:
lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)

Prior probabilities of groups:
 Down    Up 
0.492 0.508 

Group means:
         Lag1     Lag2
Down  0.04279  0.03389
Up   -0.03955 -0.03133

Coefficients of linear discriminants:
         LD1
Lag1 -0.6420
Lag2 -0.5135

The LDA output indicates that $\hat\pi_1=0.492$ and $\hat\pi_2=0.508$; in other words, $49.2$\,\% of the training observations correspond to days during which the market went down. It also provides the group means; these are the average of each predictor within each class, and are used by LDA as estimates of $\mu_k$. These suggest that there is a tendency for the previous 2~days' returns to be negative on days when the market increases, and a tendency for the previous days' returns to be positive on days when the market declines. The coefficients of linear discriminants output provides the linear combination of lagone and lagtwo that are used to form the LDA decision rule. In other words, these are the multipliers of the elements of $X=x$ in ( 4.24). If $-0.642\times `lagone` - 0.514 \times `lagtwo`$ is large, then the LDA classifier will predict a market increase, and if it is small, then the LDA classifier will predict a market decline.

The plot() function produces plots of the linear discriminants, obtained by computing $-0.642\times `lagone` - 0.514 \times `lagtwo`$ for each of the training observations. The Up and Down observations are displayed separately.

The predict() function returns a list with three elements. The first element, class, contains LDA's predictions about the movement of the market. The second element, posterior, is a matrix whose $k$th column contains the posterior probability that the corresponding observation belongs to the $k$th class, computed from ( 4.15). Finally, x contains the linear discriminants, described earlier.

In [15]:
lda.pred <- predict(lda.fit, Smarket.2005)
names(lda.pred)
  1. 'class'
  2. 'posterior'
  3. 'x'

As we observed in Section 4.5, the LDA and logistic regression predictions are almost identical.

In [16]:
lda.class <- lda.pred$class
table(lda.class, Direction.2005)
mean(lda.class == Direction.2005)
         Direction.2005
lda.class Down  Up
     Down   35  35
     Up     76 106
0.55952380952381

Applying a $50$\,\% threshold to the posterior probabilities allows us to recreate the predictions contained in lda.pred$class.

In [17]:
sum(lda.pred$posterior[, 1] >= .5)
sum(lda.pred$posterior[, 1] < .5)
70
182

Notice that the posterior probability output by the model corresponds to the probability that the market will decrease:

In [18]:
lda.pred$posterior[1:20, 1]
lda.class[1:20]
999
0.490179249818258
1000
0.479218499099683
1001
0.466818479852065
1002
0.474001069455248
1003
0.492787663967445
1004
0.493856154997504
1005
0.495101564646223
1006
0.487286099421815
1007
0.490701348960405
1008
0.484402624071869
1009
0.490696276120968
1010
0.511998846261919
1011
0.489515226936648
1012
0.470676122211879
1013
0.474459285611829
1014
0.479958339148108
1015
0.493577529465861
1016
0.503089377118306
1017
0.497880612141404
1018
0.488633086516518
  1. Up
  2. Up
  3. Up
  4. Up
  5. Up
  6. Up
  7. Up
  8. Up
  9. Up
  10. Up
  11. Up
  12. Down
  13. Up
  14. Up
  15. Up
  16. Up
  17. Up
  18. Down
  19. Up
  20. Up
Levels:
  1. 'Down'
  2. 'Up'

If we wanted to use a posterior probability threshold other than $50$\,\% in order to make predictions, then we could easily do so. For instance, suppose that we wish to predict a market decrease only if we are very certain that the market will indeed decrease on that day---say, if the posterior probability is at least $90$\,\%.

In [19]:
sum(lda.pred$posterior[, 1] > .9)
0

No days in 2005 meet that threshold! In fact, the greatest posterior probability of decrease in all of 2005 was $52.02$\,\%.

Quadratic Discriminant Analysis

We will now fit a QDA model to the Smarket data. QDA is implemented in R using the qda() function, which is also part of the MASS library. The syntax is identical to that of lda().

In [20]:
qda.fit <- qda(Direction ~ Lag1 + Lag2, data = Smarket,
    subset = train)
qda.fit
Call:
qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)

Prior probabilities of groups:
 Down    Up 
0.492 0.508 

Group means:
         Lag1     Lag2
Down  0.04279  0.03389
Up   -0.03955 -0.03133

The output contains the group means. But it does not contain the coefficients of the linear discriminants, because the QDA classifier involves a quadratic, rather than a linear, function of the predictors. The predict() function works in exactly the same fashion as for LDA.

In [21]:
qda.class <- predict(qda.fit, Smarket.2005)$class
table(qda.class, Direction.2005)
mean(qda.class == Direction.2005)
         Direction.2005
qda.class Down  Up
     Down   30  20
     Up     81 121
0.599206349206349

Interestingly, the QDA predictions are accurate almost $60$\,\% of the time, even though the 2005 data was not used to fit the model. This level of accuracy is quite impressive for stock market data, which is known to be quite hard to model accurately. This suggests that the quadratic form assumed by QDA may capture the true relationship more accurately than the linear forms assumed by LDA and logistic regression. However, we recommend evaluating this method's performance on a larger test set before betting that this approach will consistently beat the market!

Naive Bayes

Next, we fit a naive Bayes model to the Smarket data. Naive Bayes is implemented in R using the naiveBayes() function, which is part of the e1071 library. The syntax is identical to that of lda() and qda(). By default, this implementation of the naive Bayes classifier models each quantitative feature using a Gaussian distribution. However, a kernel density method can also be used to estimate the distributions.

In [22]:
library(e1071)
nb.fit <- naiveBayes(Direction ~ Lag1 + Lag2, data = Smarket,
    subset = train)
nb.fit
Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
 Down    Up 
0.492 0.508 

Conditional probabilities:
      Lag1
Y          [,1]  [,2]
  Down  0.04279 1.227
  Up   -0.03955 1.232

      Lag2
Y          [,1]  [,2]
  Down  0.03389 1.239
  Up   -0.03133 1.221

The output contains the estimated mean and standard deviation for each variable in each class. For example, the mean for lagone is $0.0428$ for

Direction=Down, and the standard deviation is $1.23$. We can easily verify this:

In [23]:
mean(Lag1[train][Direction[train] == "Down"])
sd(Lag1[train][Direction[train] == "Down"])
0.0427902240325866
1.22744562820108

The predict() function is straightforward.

In [24]:
nb.class <- predict(nb.fit, Smarket.2005)
table(nb.class, Direction.2005)
mean(nb.class == Direction.2005)
        Direction.2005
nb.class Down  Up
    Down   28  20
    Up     83 121
0.591269841269841

Naive Bayes performs very well on this data, with accurate predictions over $59\%$ of the time. This is slightly worse than QDA, but much better than LDA.

The predict() function can also generate estimates of the probability that each observation belongs to a particular class. %

In [25]:
nb.preds <- predict(nb.fit, Smarket.2005, type = "raw")
nb.preds[1:5, ]
A matrix: 5 × 2 of type dbl
DownUp
0.48730.5127
0.47620.5238
0.46530.5347
0.47490.5251
0.49020.5098

$K$-Nearest Neighbors

We will now perform KNN using the knn() function, which is part of the class library. This function works rather differently from the other model-fitting functions that we have encountered thus far. Rather than a two-step approach in which we first fit the model and then we use the model to make predictions, knn() forms predictions using a single command. The function requires four inputs.

  • A matrix containing the predictors associated with the training data, labeled train.X below.
  • A matrix containing the predictors associated with the data for which we wish to make predictions, labeled test.X below.
  • A vector containing the class labels for the training observations, labeled train.Direction below.
  • A value for $K$, the number of nearest neighbors to be used by the classifier.

We use the cbind() function, short for column bind, to bind the lagone and lagtwo variables together into two matrices, one for the training set and the other for the test set.

In [26]:
library(class)
train.X <- cbind(Lag1, Lag2)[train, ]
test.X <- cbind(Lag1, Lag2)[!train, ]
train.Direction <- Direction[train]

Now the knn() function can be used to predict the market's movement for the dates in 2005. We set a random seed before we apply knn() because if several observations are tied as nearest neighbors, then R will randomly break the tie. Therefore, a seed must be set in order to ensure reproducibility of results.

In [27]:
set.seed(1)
knn.pred <- knn(train.X, test.X, train.Direction, k = 1)
table(knn.pred, Direction.2005)
(83 + 43) / 252
        Direction.2005
knn.pred Down Up
    Down   43 58
    Up     68 83
0.5

The results using $K=1$ are not very good, since only $50$\,\% of the observations are correctly predicted. Of course, it may be that $K=1$ results in an overly flexible fit to the data. Below, we repeat the analysis using $K=3$.

In [28]:
knn.pred <- knn(train.X, test.X, train.Direction, k = 3)
table(knn.pred, Direction.2005)
mean(knn.pred == Direction.2005)
        Direction.2005
knn.pred Down Up
    Down   48 54
    Up     63 87
0.535714285714286

The results have improved slightly. But increasing $K$ further turns out to provide no further improvements. It appears that for this data, QDA provides the best results of the methods that we have examined so far.

KNN does not perform well on the Smarket data but it does often provide impressive results. As an example we will apply the KNN approach to the Insurance data set, which is part of the ISLR2 library. This data set includes $85$ predictors that measure demographic characteristics for 5,822 individuals. The response variable is Purchase, which indicates whether or not a given individual purchases a caravan insurance policy. In this data set, only $6$\,\% of people purchased caravan insurance.

In [29]:
dim(Caravan)
attach(Caravan)
summary(Purchase)
348 / 5822
  1. 5822
  2. 86
No
5474
Yes
348
0.0597732737890759

Because the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. Variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN classifier, than variables that are on a small scale. For instance, imagine a data set that contains two variables, salary and age (measured in dollars and years, respectively). As far as KNN is concerned, a difference of $1,000 in salary is enormous compared to a difference of $50$~years in age. Consequently, `salary` will drive the KNN classification results, and `age` will have almost no effect. This is contrary to our intuition that a salary difference of $$1{,}000$ is quite small compared to an age difference of $50$~years. Furthermore, the importance of scale to the KNN classifier leads to another issue: if we measured salary in Japanese yen, or if we measured age in minutes, then we'd get quite different classification results from what we get if these two variables are measured in dollars and years.

A good way to handle this problem is to the data so that all variables are given a mean of zero and a standard deviation of one. Then all variables will be on a comparable scale. The scale() function does just this. In standardizing the data, we exclude column $86$, because that is the qualitative Purchase variable.

In [30]:
standardized.X <- scale(Caravan[, -86])
var(Caravan[, 1])
var(Caravan[, 2])
var(standardized.X[, 1])
var(standardized.X[, 2])
165.037847395189
0.164707781931954
1
1

Now every column of standardized.X has a standard deviation of one and a mean of zero.

We now split the observations into a test set, containing the first 1,000 observations, and a training set, containing the remaining observations. We fit a KNN model on the training data using $K=1$, and evaluate its performance on the test data.%

In [31]:
test <- 1:1000
train.X <- standardized.X[-test, ]
test.X <- standardized.X[test, ]
train.Y <- Purchase[-test]
test.Y <- Purchase[test]
set.seed(1)
knn.pred <- knn(train.X, test.X, train.Y, k = 1)
mean(test.Y != knn.pred)
mean(test.Y != "No")
0.118
0.059

The vector test is numeric, with values from $1$ through $1,000$. Typing standardized.X[test, ] yields the submatrix of the data containing the observations whose indices range from $1$ to $1,000$, whereas typing

standardized.X[-test, ] yields the submatrix containing the observations whose indices do not range from $1$ to $1,000$. The KNN error rate on the 1,000 test observations is just under $12$\,\%. At first glance, this may appear to be fairly good. However, since only $6$\,\% of customers purchased insurance, we could get the error rate down to $6$\,\% by always predicting No regardless of the values of the predictors!

Suppose that there is some non-trivial cost to trying to sell insurance to a given individual. For instance, perhaps a salesperson must visit each potential customer. If the company tries to sell insurance to a random selection of customers, then the success rate will be only $6$\,\%, which may be far too low given the costs involved. Instead, the company would like to try to sell insurance only to customers who are likely to buy it. So the overall error rate is not of interest. Instead, the fraction of individuals that are correctly predicted to buy insurance is of interest.

It turns out that KNN with $K=1$ does far better than random guessing among the customers that are predicted to buy insurance. Among $77$ such customers, $9$, or $11.7$\,\%, actually do purchase insurance. This is double the rate that one would obtain from random guessing.

In [32]:
table(knn.pred, test.Y)
9 / (68 + 9)
        test.Y
knn.pred  No Yes
     No  873  50
     Yes  68   9
0.116883116883117

Using $K=3$, the success rate increases to $19$\,\%, and with $K=5$ the rate is $26.7$\,\%. This is over four times the rate that results from random guessing. It appears that KNN is finding some real patterns in a difficult data set!

In [33]:
knn.pred <- knn(train.X, test.X, train.Y, k = 3)
table(knn.pred, test.Y)
5 / 26
knn.pred <- knn(train.X, test.X, train.Y, k = 5)
table(knn.pred, test.Y)
4 / 15
        test.Y
knn.pred  No Yes
     No  920  54
     Yes  21   5
0.192307692307692
        test.Y
knn.pred  No Yes
     No  930  55
     Yes  11   4
0.266666666666667

However, while this strategy is cost-effective, it is worth noting that only 15 customers are predicted to purchase insurance using KNN with $K=5$. In practice, the insurance company may wish to expend resources on convincing more than just 15 potential customers to buy insurance.

As a comparison, we can also fit a logistic regression model to the data. If we use $0.5$ as the predicted probability cut-off for the classifier, then we have a problem: only seven of the test observations are predicted to purchase insurance. Even worse, we are wrong about all of these! However, we are not required to use a cut-off of $0.5$. If we instead predict a purchase any time the predicted probability of purchase exceeds $0.25$, we get much better results: we predict that 33 people will purchase insurance, and we are correct for about $33$\,\% of these people. This is over five times better than random guessing!

In [34]:
glm.fits <- glm(Purchase ~ ., data = Caravan,
    family = binomial, subset = -test)
glm.probs <- predict(glm.fits, Caravan[test, ],
    type = "response")
glm.pred <- rep("No", 1000)
glm.pred[glm.probs > .5] <- "Yes"
table(glm.pred, test.Y)
glm.pred <- rep("No", 1000)
glm.pred[glm.probs > .25] <- "Yes"
table(glm.pred, test.Y)
11 / (22 + 11)
Warning message:
“glm.fit: fitted probabilities numerically 0 or 1 occurred”
        test.Y
glm.pred  No Yes
     No  934  59
     Yes   7   0
        test.Y
glm.pred  No Yes
     No  919  48
     Yes  22  11
0.333333333333333

Poisson Regression

Finally, we fit a Poisson regression model to the Bikeshare data set, which measures the number of bike rentals (bikers) per hour in Washington, DC. The data can be found in the ISLR2 library.

In [35]:
attach(Bikeshare)
dim(Bikeshare)
names(Bikeshare)
  1. 8645
  2. 15
  1. 'season'
  2. 'mnth'
  3. 'day'
  4. 'hr'
  5. 'holiday'
  6. 'weekday'
  7. 'workingday'
  8. 'weathersit'
  9. 'temp'
  10. 'atemp'
  11. 'hum'
  12. 'windspeed'
  13. 'casual'
  14. 'registered'
  15. 'bikers'

We begin by fitting a least squares linear regression model to the data.

In [36]:
mod.lm <- lm(
    bikers ~ mnth + hr + workingday + temp + weathersit,
    data = Bikeshare
  )
summary(mod.lm)
Call:
lm(formula = bikers ~ mnth + hr + workingday + temp + weathersit, 
    data = Bikeshare)

Residuals:
   Min     1Q Median     3Q    Max 
-299.0  -45.7   -6.2   41.1  425.3 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 -68.63       5.31  -12.93  < 2e-16 ***
mnthFeb                       6.85       4.29    1.60  0.11040    
mnthMarch                    16.55       4.30    3.85  0.00012 ***
mnthApril                    41.42       4.97    8.33  < 2e-16 ***
mnthMay                      72.56       5.64   12.86  < 2e-16 ***
mnthJune                     67.82       6.54   10.36  < 2e-16 ***
mnthJuly                     45.32       7.08    6.40  1.6e-10 ***
mnthAug                      53.24       6.64    8.02  1.2e-15 ***
mnthSept                     66.68       5.92   11.25  < 2e-16 ***
mnthOct                      75.83       4.95   15.32  < 2e-16 ***
mnthNov                      60.31       4.61   13.08  < 2e-16 ***
mnthDec                      46.46       4.27   10.88  < 2e-16 ***
hr1                         -14.58       5.70   -2.56  0.01054 *  
hr2                         -21.58       5.73   -3.76  0.00017 ***
hr3                         -31.14       5.78   -5.39  7.3e-08 ***
hr4                         -36.91       5.80   -6.36  2.1e-10 ***
hr5                         -24.14       5.74   -4.21  2.6e-05 ***
hr6                          20.60       5.70    3.61  0.00031 ***
hr7                         120.09       5.69   21.10  < 2e-16 ***
hr8                         223.66       5.69   39.31  < 2e-16 ***
hr9                         120.58       5.69   21.18  < 2e-16 ***
hr10                         83.80       5.71   14.69  < 2e-16 ***
hr11                        105.42       5.72   18.42  < 2e-16 ***
hr12                        137.28       5.74   23.92  < 2e-16 ***
hr13                        136.04       5.76   23.62  < 2e-16 ***
hr14                        126.64       5.78   21.92  < 2e-16 ***
hr15                        132.09       5.78   22.85  < 2e-16 ***
hr16                        178.52       5.77   30.93  < 2e-16 ***
hr17                        296.27       5.75   51.54  < 2e-16 ***
hr18                        269.44       5.74   46.98  < 2e-16 ***
hr19                        186.26       5.71   32.60  < 2e-16 ***
hr20                        125.55       5.70   22.01  < 2e-16 ***
hr21                         87.55       5.69   15.38  < 2e-16 ***
hr22                         59.12       5.69   10.39  < 2e-16 ***
hr23                         26.84       5.69    4.72  2.4e-06 ***
workingday                    1.27       1.78    0.71  0.47681    
temp                        157.21      10.26   15.32  < 2e-16 ***
weathersitcloudy/misty      -12.89       1.96   -6.56  5.6e-11 ***
weathersitlight rain/snow   -66.49       2.97  -22.42  < 2e-16 ***
weathersitheavy rain/snow  -109.74      76.67   -1.43  0.15234    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 76.5 on 8605 degrees of freedom
Multiple R-squared:  0.675,	Adjusted R-squared:  0.673 
F-statistic:  457 on 39 and 8605 DF,  p-value: <2e-16

Due to space constraints, we truncate the output of summary(mod.lm). In mod.lm, the first level of hr (0) and mnth (Jan) are treated as the baseline values, and so no coefficient estimates are provided for them: implicitly, their coefficient estimates are zero, and all other levels are measured relative to these baselines. For example, the Feb coefficient of $6.845$ signifies that, holding all other variables constant, there are on average about 7 more riders in February than in January. Similarly there are about 16.5 more riders in March than in January.

The results seen in Section 4.6.1 used a slightly different coding of the variables hr and mnth, as follows:

In [37]:
contrasts(Bikeshare$hr) = contr.sum(24)
contrasts(Bikeshare$mnth) = contr.sum(12)
mod.lm2 <- lm(
    bikers ~ mnth + hr + workingday + temp + weathersit,
    data = Bikeshare
  )
summary(mod.lm2)
Call:
lm(formula = bikers ~ mnth + hr + workingday + temp + weathersit, 
    data = Bikeshare)

Residuals:
   Min     1Q Median     3Q    Max 
-299.0  -45.7   -6.2   41.1  425.3 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 73.597      5.132   14.34  < 2e-16 ***
mnth1                      -46.087      4.085  -11.28  < 2e-16 ***
mnth2                      -39.242      3.539  -11.09  < 2e-16 ***
mnth3                      -29.536      3.155   -9.36  < 2e-16 ***
mnth4                       -4.662      2.741   -1.70   0.0889 .  
mnth5                       26.470      2.851    9.28  < 2e-16 ***
mnth6                       21.732      3.465    6.27  3.7e-10 ***
mnth7                       -0.763      3.908   -0.20   0.8453    
mnth8                        7.156      3.535    2.02   0.0430 *  
mnth9                       20.591      3.046    6.76  1.5e-11 ***
mnth10                      29.747      2.700   11.02  < 2e-16 ***
mnth11                      14.223      2.860    4.97  6.7e-07 ***
hr1                        -96.142      3.955  -24.31  < 2e-16 ***
hr2                       -110.721      3.966  -27.92  < 2e-16 ***
hr3                       -117.721      4.016  -29.31  < 2e-16 ***
hr4                       -127.283      4.081  -31.19  < 2e-16 ***
hr5                       -133.050      4.117  -32.32  < 2e-16 ***
hr6                       -120.278      4.037  -29.79  < 2e-16 ***
hr7                        -75.542      3.992  -18.93  < 2e-16 ***
hr8                         23.951      3.969    6.04  1.7e-09 ***
hr9                        127.520      3.950   32.28  < 2e-16 ***
hr10                        24.440      3.936    6.21  5.6e-10 ***
hr11                       -12.341      3.936   -3.14   0.0017 ** 
hr12                         9.281      3.945    2.35   0.0187 *  
hr13                        41.142      3.957   10.40  < 2e-16 ***
hr14                        39.894      3.975   10.04  < 2e-16 ***
hr15                        30.494      3.991    7.64  2.4e-14 ***
hr16                        35.944      3.995    9.00  < 2e-16 ***
hr17                        82.379      3.988   20.66  < 2e-16 ***
hr18                       200.125      3.964   50.49  < 2e-16 ***
hr19                       173.299      3.956   43.81  < 2e-16 ***
hr20                        90.114      3.940   22.87  < 2e-16 ***
hr21                        29.407      3.936    7.47  8.7e-14 ***
hr22                        -8.588      3.933   -2.18   0.0290 *  
hr23                       -37.019      3.934   -9.41  < 2e-16 ***
workingday                   1.270      1.784    0.71   0.4768    
temp                       157.209     10.261   15.32  < 2e-16 ***
weathersitcloudy/misty     -12.890      1.964   -6.56  5.6e-11 ***
weathersitlight rain/snow  -66.494      2.965  -22.42  < 2e-16 ***
weathersitheavy rain/snow -109.745     76.667   -1.43   0.1523    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 76.5 on 8605 degrees of freedom
Multiple R-squared:  0.675,	Adjusted R-squared:  0.673 
F-statistic:  457 on 39 and 8605 DF,  p-value: <2e-16

What is the difference between the two codings? In mod.lm2, a coefficient estimate is reported for all but the last level of hr and mnth. Importantly, in mod.lm2, the coefficient estimate for the last level of mnth is not zero: instead, it equals the negative of the sum of the coefficient estimates for all of the other levels. Similarly, in mod.lm2, the coefficient estimate for the last level of hr is the negative of the sum of the coefficient estimates for all of the other levels. This means that the coefficients of hr and mnth in mod.lm2 will always sum to zero, and can be interpreted as the difference from the mean level. For example, the coefficient for January of $-46.087$ indicates that, holding all other variables constant, there are typically 46 fewer riders in January relative to the yearly average.

It is important to realize that the choice of coding really does not matter, provided that we interpret the model output correctly in light of the coding used. For example, we see that the predictions from the linear model are the same regardless of coding:

In [38]:
sum((predict(mod.lm) - predict(mod.lm2))^2)
7.35067529480971e-23

The sum of squared differences is zero. We can also see this using the all.equal() function:

In [39]:
all.equal(predict(mod.lm), predict(mod.lm2))
TRUE

To reproduce the left-hand side of Figure 4.13, we must first obtain the coefficient estimates associated with mnth. The coefficients for January through November can be obtained directly from the mod.lm2 object. The coefficient for December must be explicitly computed as the negative sum of all the other months.

In [40]:
coef.months <- c(coef(mod.lm2)[2:12],
    -sum(coef(mod.lm2)[2:12]))

To make the plot, we manually label the $x$-axis with the names of the months.

In [41]:
plot(coef.months, xlab = "Month", ylab = "Coefficient",
    xaxt = "n", col = "blue", pch = 19, type = "o")
axis(side = 1, at = 1:12, labels = c("J", "F", "M", "A",
    "M", "J", "J", "A", "S", "O", "N", "D"))

Reproducing the right-hand side of Figure 4.13 follows a similar process.

In [42]:
coef.hours <- c(coef(mod.lm2)[13:35],
    -sum(coef(mod.lm2)[13:35]))
plot(coef.hours, xlab = "Hour", ylab = "Coefficient",
    col = "blue", pch = 19, type = "o")

Now, we consider instead fitting a Poisson regression model to the Bikeshare data. Very little changes, except that we now use the function glm() with the argument family = poisson to specify that we wish to fit a Poisson regression model:

In [43]:
mod.pois <- glm(
    bikers ~ mnth + hr + workingday + temp + weathersit,
    data = Bikeshare, family = poisson
  )
summary(mod.pois)
Call:
glm(formula = bikers ~ mnth + hr + workingday + temp + weathersit, 
    family = poisson, data = Bikeshare)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-20.757   -3.344   -0.655    2.700   21.963  

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)                4.11824    0.00602  683.96  < 2e-16 ***
mnth1                     -0.67017    0.00591 -113.44  < 2e-16 ***
mnth2                     -0.44412    0.00486  -91.38  < 2e-16 ***
mnth3                     -0.29373    0.00414  -70.89  < 2e-16 ***
mnth4                      0.02152    0.00312    6.89  5.7e-12 ***
mnth5                      0.24047    0.00292   82.46  < 2e-16 ***
mnth6                      0.22324    0.00355   62.82  < 2e-16 ***
mnth7                      0.10362    0.00412   25.12  < 2e-16 ***
mnth8                      0.15117    0.00366   41.28  < 2e-16 ***
mnth9                      0.23349    0.00310   75.28  < 2e-16 ***
mnth10                     0.26757    0.00278   96.09  < 2e-16 ***
mnth11                     0.15026    0.00318   47.25  < 2e-16 ***
hr1                       -0.75439    0.00788  -95.74  < 2e-16 ***
hr2                       -1.22598    0.00995 -123.17  < 2e-16 ***
hr3                       -1.56315    0.01187 -131.70  < 2e-16 ***
hr4                       -2.19830    0.01642 -133.85  < 2e-16 ***
hr5                       -2.83048    0.02254 -125.59  < 2e-16 ***
hr6                       -1.81466    0.01346 -134.78  < 2e-16 ***
hr7                       -0.42989    0.00690  -62.34  < 2e-16 ***
hr8                        0.57518    0.00441  130.54  < 2e-16 ***
hr9                        1.07693    0.00356  302.22  < 2e-16 ***
hr10                       0.58177    0.00429  135.73  < 2e-16 ***
hr11                       0.33685    0.00472   71.37  < 2e-16 ***
hr12                       0.49412    0.00439  112.49  < 2e-16 ***
hr13                       0.67964    0.00407  167.04  < 2e-16 ***
hr14                       0.67356    0.00409  164.72  < 2e-16 ***
hr15                       0.62491    0.00418  149.57  < 2e-16 ***
hr16                       0.65376    0.00413  158.21  < 2e-16 ***
hr17                       0.87430    0.00378  231.04  < 2e-16 ***
hr18                       1.29463    0.00325  397.85  < 2e-16 ***
hr19                       1.21228    0.00332  365.08  < 2e-16 ***
hr20                       0.91402    0.00370  247.06  < 2e-16 ***
hr21                       0.61620    0.00419  147.05  < 2e-16 ***
hr22                       0.36418    0.00466   78.17  < 2e-16 ***
hr23                       0.11749    0.00522   22.49  < 2e-16 ***
workingday                 0.01467    0.00195    7.50  6.3e-14 ***
temp                       0.78529    0.01148   68.43  < 2e-16 ***
weathersitcloudy/misty    -0.07523    0.00218  -34.53  < 2e-16 ***
weathersitlight rain/snow -0.57580    0.00406 -141.90  < 2e-16 ***
weathersitheavy rain/snow -0.92629    0.16678   -5.55  2.8e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1052921  on 8644  degrees of freedom
Residual deviance:  228041  on 8605  degrees of freedom
AIC: 281159

Number of Fisher Scoring iterations: 5

We can plot the coefficients associated with mnth and hr, in order to reproduce Figure 4.15:

In [44]:
coef.mnth <- c(coef(mod.pois)[2:12],
    -sum(coef(mod.pois)[2:12]))
plot(coef.mnth, xlab = "Month", ylab = "Coefficient",
     xaxt = "n", col = "blue", pch = 19, type = "o")
axis(side = 1, at = 1:12, labels = c("J", "F", "M", "A", "M", "J", "J", "A", "S", "O", "N", "D"))
coef.hours <- c(coef(mod.pois)[13:35],
     -sum(coef(mod.pois)[13:35]))
plot(coef.hours, xlab = "Hour", ylab = "Coefficient",
    col = "blue", pch = 19, type = "o")

We can once again use the predict() function to obtain the fitted values (predictions) from this Poisson regression model. However, we must use the argument type = "response" to specify that we want R to output $\exp(\hat\beta_0 + \hat\beta_1 X_1 + \ldots +\hat\beta_p X_p)$ rather than $\hat\beta_0 + \hat\beta_1 X_1 + \ldots + \hat\beta_p X_p$, which it will output by default.

In [45]:
plot(predict(mod.lm2), predict(mod.pois, type = "response"))
abline(0, 1, col = 2, lwd = 3)

The predictions from the Poisson regression model are correlated with those from the linear model; however, the former are non-negative. As a result the Poisson regression predictions tend to be larger than those from the linear model for either very low or very high levels of ridership.

In this section, we used the glm() function with the argument family = poisson in order to perform Poisson regression. Earlier in this lab we used the glm() function with family = binomial to perform logistic regression. Other choices for the family argument can be used to fit other types of GLMs. For instance, family = Gamma fits a gamma regression model.