Chapter 1 Introduction

Statistics in the news

“It’s machine learning allows the computer to become smarter as it tries to answer questions - and to learn as it gets them right or wrong.”

Statistical learning problems

Identify the risk factors for prostate cancer.

Classify a recorded phoneme based on a log-periodogram

Predict whether someone will have a heart attack on the basis of demographic, diet and clinical measurements.

Customize an email spam detection system.
- data from 4601 emails sent to an individual (named George, at HP labs, before 2000). Each is labeled as spam or email.
- goal: build a customized spam filter.
- input features: relative frequencies of 57 of the most commonly occurring words and punctuation marks in these email messages.

Average percentage of words or characters in an email message equal to the indicated word or character. We have chosen the words and characters showing the largest difference between spam and email.
Identify the numbers in a handwritten zip code.

Classify a tissue sample into one of several cancer classes, based on a gene expression profile.

Establish the relationship between salary and demographic variables in population survey data.

Classify the pixels in a LANDSAT image, by usage.

In these classes we take a statistical learning perspective to statistical techniques of which various multivariate analyses are part of. According to James et al. p.1.

Statistical learning refers to a vast set of tools for understanding data.

1.1 Understanding Data

Tools for understanding data can be broadly classified as

Supervised learning

Supervised learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs.
- The key is learning from training data set and use the training results for prediction purposes as new input data becomes available.
  - The learning problem consist of inferring from the training data set the function that can be used to map future inputs to predictions, i.e., give input variables x and output variable(s) \(y\), find function \(f\) such that \(f(x) \approx y\) in a predictive way.
Starting point:
- Outcome measurement \(Y\) (also called dependent variable, response, target)
- Vector of \(p\) predictor measurements \(X\) (also called inputs, regressors, covariates, features, independent variables).
- In the regression problem, \(Y\) is quantitative (e.g price, blood pressure).
- In the *classification problem, \(Y\) takes values in a finite, unordered set (survived/died, digit 0-9, cancer class of tissue sample).
- We have training data \((x_1,y_1), \ldots, (x_N,y_N)\). These are observations (examples, instances) of these measurements.
Tools
- Regression methods; find the functional mapping of input variables to quantitative output variable(s) (e.g. how wage is related to some background variables, like age, education, gender, etc.).
- Classification methods; find functional mapping of input variables to discrete set of classes (e.g. how different financial ratios predict firm solvency {solvent, non-solvent}).
On the basis of the training data we would like to:
- Accurately predict unseen test cases.
- Understand which inputs affect the outcome, and how.
- Assess the quality of out predictions and inferences.

Unsupervised learning

There are inputs but no supervising output; from such data we can learn relationships and structures.
- No outcome variable, just a set of predictors (features) measured on a set of samples.
- objective is more fuzzy - find groups of samples that behave similarly, find features that behave similarly, find linear combinations of features with the most variation.
- difficult to know how well your are doing.
- different from supervised learning, but can be useful as a pre-processing step for supervised learning.
Tools: various clustering methods

The Netflix prize

Competition started in October 2006. Training data is ratings for 18,000 movies by 400000 Netflix customers, each rating between 1 and 5.
Training data is very sparse| about 98% missing.
Objective is to predict the rating for a set of 1 million customer-movie pairs that are missing in the training data.
Netflix’s original algorithm achieved a root MSE of 0.953.
The first team to achieve a 10% improvement wins one million dollars.
Is this a supervised or unsupervised problem?

Statistical Learning versus Machine Learning

Machine learning arose as a subfield of Artificial Intelligence.
Statistical learning arose as a subfield of Statistics.
There is much overlap - both fields focus on supervised and unsupervised problems:
- Machine learning has a greater emphasis on large scale applications and prediction accuracy.
- Statistical learning emphasizes models and their interpretability, and precision and uncertainty.
But the distinction has become more and more blurred, and there is a great deal of “cross-fertilization”.
Machine learning has the upper hand in Marketing!

1.1.1 Example 1 (Supervised learning: Continuous output)

Wages of a group of men from the Atlantic region of the US.
The interest is in the relation/effects of various background factors (like age, education, calendar year) on wage

#install.packages("ISLR2") # install ISLR2 if not done yet
library("ISLR2") # load ISLR2
help(package = "ISLR2") # short info about the package

summary(Wage) # summarizing Wage data

##       year           age                     maritl           race     
##  Min.   :2003   Min.   :18.00   1. Never Married: 648   1. White:2480  
##  1st Qu.:2004   1st Qu.:33.75   2. Married      :2074   2. Black: 293  
##  Median :2006   Median :42.00   3. Widowed      :  19   3. Asian: 190  
##  Mean   :2006   Mean   :42.41   4. Divorced     : 204   4. Other:  37  
##  3rd Qu.:2008   3rd Qu.:51.00   5. Separated    :  55                  
##  Max.   :2009   Max.   :80.00                                          
##                                                                        
##               education                     region               jobclass   
##  1. < HS Grad      :268   2. Middle Atlantic   :3000   1. Industrial :1544  
##  2. HS Grad        :971   1. New England       :   0   2. Information:1456  
##  3. Some College   :650   3. East North Central:   0                        
##  4. College Grad   :685   4. West North Central:   0                        
##  5. Advanced Degree:426   5. South Atlantic    :   0                        
##                           6. East South Central:   0                        
##                           (Other)              :   0                        
##             health      health_ins      logwage           wage       
##  1. <=Good     : 858   1. Yes:2083   Min.   :3.000   Min.   : 20.09  
##  2. >=Very Good:2142   2. No : 917   1st Qu.:4.447   1st Qu.: 85.38  
##                                      Median :4.653   Median :104.92  
##                                      Mean   :4.654   Mean   :111.70  
##                                      3rd Qu.:4.857   3rd Qu.:128.68  
##                                      Max.   :5.763   Max.   :318.34  
##

str(Wage) # structure of Wage data

## 'data.frame':    3000 obs. of  11 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...

par(mfcol = c(1,3)) # divide the screen vertically to three parts
plot(x = Wage$age, y = Wage$wage, xlab = "Age", ylab = "Annual Wage (1,000 USD)", col = "gray", pch = 20) # Wage and age
lines(lowess(x = Wage$age, y = Wage$wage, f = 1/5), col = "blue", lwd = 2) # impose trend line using lowess function
plot(x = Wage$year, y = Wage$wage, xlab = "Year", ylab = "Annual Wage (1,000 USD)", col = "gray", pch = 20) # year and wage
lines(lowess(x = Wage$year, y = Wage$wage), col = "blue", lwd = 2) # again using lowes to impose a trend line
plot(x = as.factor(as.numeric(Wage$education)), y = Wage$wage, xlab = "Eduction Level",
     ylab = "Annual Wage (1,000 USD)",
     col = c("steel blue", "green", "yellow", "light blue", "red")) ## box plots (when x is a factor variable (catgorial variable))

There is considerable variability in wages. The trend in the left hand panel shows that wages tend to rise up to age 45 followed some decreasing in older ages.
The middle panel show some increase over the years, the right hand panel shows clear incremental effect of education (1 = no high school diploma, 5 = advanced graduate degree).

1.1.2 Example 2 (Supervised leraning: Categorical output)

Predict German stock market direction (Up or Down) next day on the basis of past few days direction (daily returns from the beginning of 2012 until Oct 17, 2018).

summary(Smarket) # summarizing Smarket data

##       Year           Lag1                Lag2                Lag3          
##  Min.   :2001   Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.922000  
##  1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500   1st Qu.:-0.640000  
##  Median :2003   Median : 0.039000   Median : 0.039000   Median : 0.038500  
##  Mean   :2003   Mean   : 0.003834   Mean   : 0.003919   Mean   : 0.001716  
##  3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
##  Max.   :2005   Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.733000  
##       Lag4                Lag5              Volume           Today          
##  Min.   :-4.922000   Min.   :-4.92200   Min.   :0.3561   Min.   :-4.922000  
##  1st Qu.:-0.640000   1st Qu.:-0.64000   1st Qu.:1.2574   1st Qu.:-0.639500  
##  Median : 0.038500   Median : 0.03850   Median :1.4229   Median : 0.038500  
##  Mean   : 0.001636   Mean   : 0.00561   Mean   :1.4783   Mean   : 0.003138  
##  3rd Qu.: 0.596750   3rd Qu.: 0.59700   3rd Qu.:1.6417   3rd Qu.: 0.596750  
##  Max.   : 5.733000   Max.   : 5.73300   Max.   :3.1525   Max.   : 5.733000  
##  Direction 
##  Down:602  
##  Up  :648  
##            
##            
##            
##

str(Smarket) # structure of Smarket data

## 'data.frame':    1250 obs. of  9 variables:
##  $ Year     : num  2001 2001 2001 2001 2001 ...
##  $ Lag1     : num  0.381 0.959 1.032 -0.623 0.614 ...
##  $ Lag2     : num  -0.192 0.381 0.959 1.032 -0.623 ...
##  $ Lag3     : num  -2.624 -0.192 0.381 0.959 1.032 ...
##  $ Lag4     : num  -1.055 -2.624 -0.192 0.381 0.959 ...
##  $ Lag5     : num  5.01 -1.055 -2.624 -0.192 0.381 ...
##  $ Volume   : num  1.19 1.3 1.41 1.28 1.21 ...
##  $ Today    : num  0.959 1.032 -0.623 0.614 0.213 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...

par(mfrow = c(1, 3))
plot(x = Smarket$Direction, y = Smarket$Lag1, ylab = "Today's Direction", xlab = "Dax Return (%) Yesterday",
     horizontal = TRUE, col = c("orange", "skyblue"))
plot(x = Smarket$Direction, y = Smarket$Lag2, ylab = "Today's Direction", xlab = "Dax Return (%) Two Days Ago",
     horizontal = TRUE, col = c("orange", "skyblue"))
plot(x = Smarket$Direction, y = Smarket$Lag3, ylab = "Today's Direction", xlab = "Dax Return (%) Three Days Ago",
     horizontal = TRUE, col = c("orange", "skyblue"))

As valuable as it would be, the graphs show that most likely historical returns do not help much in predicting the direction of future stock markets.

1.1.3 Example 3 (Unsupervised learning: Clustering observations)

Clustering consumers according to their consumption habits is an example of unsupervised learning.
The interest is to find clusters of similar consumers with similar consumption habits on the basis of the collected purchasing data.
K-means clustering and various hierarchical clustering methods are popular tools.

1.2 Brief History of Statistical Learning

Brief history:

Method of least squares (Legendre and Gauss, beginning of 19th century).
Linear discriminant analysis (Fisher, 1936).
Logistic regression (1940s by various authors).
Generalized linear models (Nelder and Wedderburn, 1970s).
Classification and regression trees (Breiman, Friedman, Olsen, and Stone, 1980s).
Generalized additive models (Hastie and Tibshirani, 1986).
Neural networks (Rumelhart and McClelland, 1986).
Support vector machines (Vapnik, 1992)
Nowadays by the advent of machine learning and other disciplines (analytics, big data), statistical learning is becoming a new subfield in statistics.

1.3 Notation and Simple Matrix Algebra

Choosing notation is always a difficult task.

We will use \(n\) to represent the number of distinct data points, or observations, in sample.
Let \(p\) denote the number of variables that are available for use in making predictions.
For example, the Wage data set consist of 11 variables for 3,000 people, so we have \(n=3000\) observations and \(p=11\) variables (such as year, age, race, and more).
In general, we will let \(x_{ij}\) represent the value of the \(j\)th variable for the \(i\)th observations, where \(i=1,2,\ldots,n\) and \(j=1,2,\ldots,p\).
Let \(\mathbf{X}\) denote an \(n\times p\) matrix whose \((i,j)\)th element is \(x_{ij}\).

\[ \begin{pmatrix} x_{11}&x_{12}&\cdots&x_{1p}\\ x_{21}&x_{22}&\cdots&x_{2p}\\ \vdots&\vdots&\ddots&\vdots\\ x_{n1}&x_{n2}&\cdots&x_{np}\\ \end{pmatrix} \]

At times we will be interested in the rows of \(\mathbf{X}\), which we write as \(x_1, x_2, \ldots, x_n\).
- Here \(x_i\) is a vector of length \(p\), containing the \(p\) variable measurements for the \(i\)th observation.

\[ x_i=\begin{pmatrix} x_{i1}\\ x_{i2}\\ \vdots\\ x_{ip}\\ \end{pmatrix} \] Vectors are by default represented as columns.

For columns of \(\mathbf{X}\), which we write as \(\mathbf{x}_1,\mathbf{x}_2,\ldots,\mathbf{x}_p\).
- Each is a vector of length \(n\)

\[ \mathbf{x}_j=\begin{pmatrix} x_{1j}\\ x_{2j}\\ \vdots\\ x_{nj}\\ \end{pmatrix} \]

The matrix \(\mathbf{X}\) can be written as

\[ \mathbf{X}=(\mathbf{x}_1 \,\,\, \mathbf{x}_2 \,\,\, \cdots \,\,\, \mathbf{x}_p) \]

\[ \mathbf{X}=\begin{pmatrix} x_{1}^T\\ x_{2}^T\\ \vdots\\ x_{n}^T\\ \end{pmatrix} \]

The \(^T\) notation denotes the transpose of a matrix or vector.

\[ \mathbf{X}^T=\begin{pmatrix} x_{11}&x_{21}&\cdots&x_{n1}\\ x_{12}&x_{22}&\cdots&x_{n2}\\ \vdots&\vdots&\ddots&\vdots\\ x_{1p}&x_{2p}&\cdots&x_{np}\\ \end{pmatrix} \]

while

\[ x_i^T=(x_{i1} \,\,\, x_{i2} \,\,\, \cdots \,\,\, x_{ip}) \]

We use \(y_i\) to denote the \(i\)th observation of the variable on which we which to make predictions.
The set of all \(n\) observations in vector form as

\[ \mathbf{y}=\begin{pmatrix} y_{1}\\ y_{2}\\ \vdots\\ y_{n}\\ \end{pmatrix} \]

Then observed data consists of \(\{(x_1,y_1), (x_2,y_2), \ldots, (x_n,y_n)\}\), where each \(x_i\) is a vector of length \(p\).
A vector of length \(n\) will always be denoted in \(lower case bold\); e.g.

\[ \mathbf{a}=\begin{pmatrix} a_{1}\\ a_{2}\\ \vdots\\ a_{n}\\ \end{pmatrix} \]

Vectors that are not of length \(n\) (such as feature vectors of length \(p\)) will be denoted in *lower case normal font, e.g. a.
Matrices will be denoted using bold capitals, such as \(\mathbf{A}\).
Random variables will be denoted using capital normal font, e.g. \(A\), regardless of their dimensions.
Occasionally, to indicate the dimension of a particular object, for example a scalar, we will use the notation \(a \in \Re\).
To indicate that it is a vector of length \(k\), we will use \(a \in \Re^k\) (or \(\mathbf{a} \in \Re^n\) if it is of length \(n\))
We will indicate that an object is an \(r\times s\) matrix using \(\mathbf{A}\in \Re^{r\times s}\).
For multiplying two matricex, suppose that \(\mathbf{A}\in \Re^{r\times d}\) and \(\mathbf{B}\in \Re^{d\times s}\)
- Then the product of \(\mathbf{A}\) and \(\mathbf{B}\) is denoted \(\mathbf{AB}\).
- The \((i,j)\)th element of \(\mathbf{AB}\) is computed by multiplying each element of the \(i\)th row of \(\mathbf{A}\) by the corresponding element of the \(j\)th column of \(\mathbf{B}\).
- That is, \((\mathbf{AB})_{ij}=\sum_{k=1}^d a_{ik}b_{kj}\).
Example

\[ \mathbf{A}=\begin{pmatrix} 1&2\\ 3&4\\ \end{pmatrix} \,\,\, and \,\,\, \mathbf{B}=\begin{pmatrix} 5&6\\ 7&8\\ \end{pmatrix} \]

Then

\[ \mathbf{AB}=\begin{pmatrix} 1&2\\ 3&4\\ \end{pmatrix} \begin{pmatrix} 5&6\\ 7&8\\ \end{pmatrix}=\begin{pmatrix} 1\times 5+2\times 7&1\times 6 + 2\times 8\\ 3\times 6 + 4\times 7&3\times 6 + 4 \times 8\\ \end{pmatrix}= \begin{pmatrix} 19&22\\ 43&50\\ \end{pmatrix} \]

Note that this operation produces an \(r\times s\) matrix.
- It is only pssible to compute \(\mathbf{AB}\) if the number of columns of \(\mathbf{A}\) is the same as the number of rows of \(\mathbf{B}\).