Chapter 10 Support Vector Machines

Here we approach the two-class classification problem in a direct way:

We try and find a plane that separates the classes in feature space.

If we cannot, we get creative in two ways:
- We soften what we mean by “separates”, and
- We enrich and enlarge the feature space so that separation is possible.

What is a Hyperplane?

A hyperplane in $p$ dimensions is a flat affine subspace of dimension $p-1$.
In general the equation for a hyperplane has the form

\[ \beta_0+\beta_1X_1 +\beta_2X_2 +\cdots +\beta_pX_p=0 \]

In $p=2$ dimensions a hyperplane is a line.
If $\beta_0=0$, the hyperplane goes through the origin, otherwise not.
The vecctor $\beta=(\beta_1,\beta_2, \cdots, \beta_p)$ is called the normal vector - it points in a direction orthogonal to the surface of a hyperplane.
Hyperplane in 2 dimensions

The hyperplane $1+2x_1+3x_2=0$ divides the $xy$-plane in which $1+2x_1+3x_2>0$ is indicated by green points and $1+2x_1+3x_2<0$ by red points.

plot(NULL, type = "n", xlim = c(-2, 2), ylim = c(-2, 2),
     xlab = bquote(x[1]), ylab = bquote(x[2]),
     main = expression(paste("Affine hyperplane: 1 + 2", x[1], " + 3", x[2], " = 0", sep = "")))
## add line
abline(a = -1/3, b = -2/3)

## points
x1 <- x2 <- seq(from = -2, to = 2, length = 40)
for (i in 1:length(x1)) {
    x <- rep(x1[i], times = length(x1))
    points(x = x, y = x2, pch = 20, cex = .5,
           col = ifelse(1 + 2*x + 3*x2 > 0, "green", "red"))
}
points(x = 1, y = 1, pch = 20, col = "blue")
text(x = 1, y = 1, labels = "(1, 1)", cex = .9, col = "blue", pos = 4)

Separating hyperplanes

If $f(X)=\beta_0+\beta_1X_1 +\beta_2X_2 +\cdots +\beta_pX_p$, then $f(X)>0$ for points on one side of the hyperplane, and $f(X)<0$ for points on the other.
If we code the colored points as $Y_i=+1$ for blue, say, and $Y_i=-1$ for mauve, then if $Y_if(X_i)>0 for all $i$, $f(X)=0$ defines a separating hyperplane.

Maximal margin classifier

Among all separating hyperplanes, find the one that makes the biggest gap or margin between the two classes.

This can be rephrased as a convex quadratic program, and solved efficiently.
- The function svm() in package e1071 solves this problem efficiently.

Non-separable data

Noisy data

Sometimes the data are separable, but noisy.
- This can lead to a poor solution for the maximal-margin classifier.
The support vector classifier maximizes a soft margin.

Support vector classifier

$C$ is a regularization parameter

Example

R livrary e1071 contains svm() function to perform support vector machines and a number of other methods.
To illustrate the support vector classifier, consider simulated data generated from $N(\mu_i, \Sigma)$, $i=1,2$, with $\mu_1=(0,0)'$, $\mu_2=(1,1)'$, and $\Sigma=I_2$, a $2\times 2$ identity matrix.

set.seed(1) # for exact replication
x <- matrix(rnorm(n = 20 * 2), ncol = 2) # two bivariate standard normal variates
y <- c(rep(-1, times = 10), rep(1, times = 10)) # class identifiers -1 and +1
x[y == 1, ] <- x[y == 1, ] + 1 # set the mean of the second to mu = (1, 1)
data <- data.frame(x = x, y = as.factor(y)) # data frame for svm
colnames(data) <- c("x1", "x2", "y") # ccolumn names 
head(data)

##           x1          x2  y
## 1 -0.6264538  0.91897737 -1
## 2  0.1836433  0.78213630 -1
## 3 -0.8356286  0.07456498 -1
## 4  1.5952808 -1.98935170 -1
## 5  0.3295078  0.61982575 -1
## 6 -0.8204684 -0.05612874 -1

plot(x = data$x2, y = data$x1, pch = 20, xlab = expression(x[2]), ylab = expression(x[1]),
     col = ifelse(y == -1, "blue", "red"), # blue for group -1 and red for +1
     main = "Two normal populations")

The figure shows that the classes are not linearly separable.

## install.packages("e1071", repos = "https://cran.uib.no")
library(e1071)
#library(help = e1071) # a brief description of the methods
svmfit <- svm(y ~ ., data = data,
              kernel = "linear", # svm with kernel "linear" produces sv classifier
              cost = 10, # this is related to C in equation (13)
              scale = FALSE # here no need to scale predictors
              ) # svm
plot(svmfit, data = data, formula = x2 ~ x1) # (see help(plot.svm)),

##          by default x1 is on the vertical axis, using formula = x2 ~ x1 changes the axes
## help(plot.svm)
## dev.print(pdf, file = "../lectures/figures/ex82b.pdf")
svmfit$index # print support vector indexes

## [1]  1  2  5  7 14 16 17

The decision boundary is liner (although appears jigged).
The seven support vectors here are indicated by crosses.
Due to the linearity one training observation becomes misclassified.

summary(svmfit) # some summary

## 
## Call:
## svm(formula = y ~ ., data = data, kernel = "linear", cost = 10, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  7
## 
##  ( 4 3 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1

round(coef(svmfit), 3) # coefficients of the linear decision boundary

## (Intercept)          x1          x2 
##       1.507      -1.556      -1.612

The tune() can be used for (10-fold) cross-validation.
The function needs a range for the cost parameter.

set.seed(1) # for exact replication

## help(tune)
tune.out <- tune(svm, # function to be tuned
                 y~., data = data, # formula expression alternative
                 kernel = "linear", # SV classifier 
                 ranges = list(cost = c(.001, .01, .1, 1, 5, 10, 100))
                 )
summary(tune.out) # Cross-validation results

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##   0.1
## 
## - best performance: 0.05 
## 
## - Detailed performance results:
##    cost error dispersion
## 1 1e-03  0.55  0.4377975
## 2 1e-02  0.55  0.4377975
## 3 1e-01  0.05  0.1581139
## 4 1e+00  0.15  0.2415229
## 5 5e+00  0.15  0.2415229
## 6 1e+01  0.15  0.2415229
## 7 1e+02  0.15  0.2415229

Cost=0.1 results the lowest cross-validation error.
The tune() function sotres the best model.

bestmod <- tune.out$best.model # extract the best model.
summary(bestmod)

## 
## Call:
## best.tune(method = svm, train.x = y ~ ., data = data, ranges = list(cost = c(0.001, 
##     0.01, 0.1, 1, 5, 10, 100)), kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.1 
## 
## Number of Support Vectors:  16
## 
##  ( 8 8 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1

round(coef(bestmod), 3) # coefficients best separating linear decision boundary

## (Intercept)          x1          x2 
##       0.061      -0.516      -0.532

plot(bestmod, data = data)

The predict() function can be used to predict the class labels at any given value of the cost parameter.
Generate test observations and produce predictions.

xtest <- matrix(rnorm(20 * 2), ncol = 2) # test observatons.
ytest <- sample(c(-1, 1), size = 20, replace = TRUE) # randomly assigned groups
xtest[ytest == 1, ] <- xtest[ytest == 1, ] + 1 # class means of class 1 observations
testdata <- data.frame(x = xtest, y = as.factor(ytest)) # test data frame
colnames(testdata) <- c("x1", "x2", "y") # test data need to have correct var names
ypred <- predict(bestmod, testdata) # predictions
table(predicted = ypred, realized = testdata$y) # predictions vs true observed

##          realized
## predicted -1 1
##        -1  9 1
##        1   2 8

## 3 out of 20 are misclassified

With cost=0.1, 19 out of the 20 observations become correctly classified.

Linear boundary can fail

Feature expansion

Enlarge the space of features by including transformation; e.g. $X_1^2, X_1^3, X_1X_2, X_1X_2^2, \ldots$
- Hence go from a $p$-dimensional space to a $M>p$ dimensional space.
Fit a support-vector classifier in the enlarged space.
This results in non-linear decision boundaries in the original space.
Example
- Suppose we use ($X_1, X_2, X_1^2, X_2^2, X_1X_2$) instead of just ($X_1, X_2$).
- Then the decision boundary would be of the form

\[ \beta_0+\beta_1X_1+\beta_2X_2+\beta_3X_1^2+\beta_4X_2^2+\beta_5X_1X_2=0 \]

This leads to nonlinear decision boundaries in the original space (quadratic conic sections).

Cubic polynomials

Here we use a basis expansion of cubic polynomials.
From 2 variables to 9.
The support-vector classifier in the enlarged space solves the problem in the lower-dimensional space

\[ \beta_0+\beta_1X_1+\beta_2X_2+\beta_3X_1^2+\beta_4X_2^2+\beta_5X_1X_2+\beta_6X_1^3+\beta_7X_2^3+\beta_8X_1X_2^2+\beta_9X_1^2X_2=0 \] #### Nonlinearities and kernels {-}

Polynomials (especially high-dimensional ones) get wild rather fast.
There is a more elegant and controlled way to introduce nonlinearities in support-vector classifiers - through the use of kernels.
Before we discuss these, we must understand the role of inner products in support-vector classifiers.

Inner products and support vectors

Inner product between vectors

\[ <x_i,x_i'>=\sum_{j=1}^p x_{ij}x_{i'j} \]

The linear support vector classifier can be represented as ($n$ parameters)

\[ f(x)=\beta_0+\sum_{i=1}^n \alpha_i <x_i,x_i'> \]

To estimate the parameters $\alpha_1,\ldots, \alpha_n$ and $\beta_0$, all we need are the $\binom{n}{2}$ inner products $<x_i,x_i'>$ between all pairs of training observations.
It turns out that most of the $\hat{\alpha}_i$ can be zero:

\[ f(x)=\beta_0+\sum_{i\in S}\hat{\alpha}<x_i,x_i'> \] - $S$ is the support set of indices $i$ such that $\hat{\alpha}_i>0$

Kernels and support vector machines

If we can compute inner-products between observations, we can fit a SV classifier.
Some special kernel functions can do this for us. E.g.

\[ K(x_i,x_i')=(1+\sum_{j=1}^p x_{ij}x_{i'j})^d \] computes the inner-products needed for $d$ dimensional polynomials - $\binom{p+d}{d}$ basis function!

The solution has the form

\[ f(x)=\beta_0+\sum_{i\in S} \hat{\alpha}_i K(x,x_i) \]

Radial Kernel

\[ K(x_i,x_i')=exp(-\gamma \sum_{j=1}^p (x_{ij}-x_{i'j})^2) \]

Example: Heart data

ROC curve is obtained by changing the threshold 0 to threshold $t$ $\hat{f}(X)>t$, and recording false positive and true positive rates as $t$ varies.
Here we see ROC curves on training data.
Example: Heart test data

Example

Generate bivariate normal with class 1 $\mu=(2,2)'$ or $\mu_1=(-2,-2)'$ and class 2 $\mu_2=(0,0)'$

library(e1071)
set.seed(1) # for exact replication
x <- matrix(rnorm(n = 200 * 2), ncol = 2) # bivariate normal
x[1:100, ] <- x[1:100, ] + 2 # shift mean of the first 100 obs
x[101:150, ] <- x[101:150, ] - 2 # nex 50 obs mean -2 (bot cols)
## the rest of bothe variables (x1 and x2) have means zero
y <- c(rep(1, times = 150), rep(2, times = 50))
data <- data.frame(x = x, y = as.factor(y))
colnames(data) <- c("x1", "x2", "y")
head(data)

##         x1         x2 y
## 1 1.373546  2.4094018 1
## 2 2.183643  3.6888733 1
## 3 1.164371  3.5865884 1
## 4 3.595281  1.6690922 1
## 5 2.329508 -0.2852355 1
## 6 1.179532  4.4976616 1

plot(x = x[, 2], y = x[, 1],
     col = ifelse(y == 1, "red", "blue"), pch = 20,
     xlab = expression(x[2]), ylab = expression(x[1]),
     main = "Non-linearly Separating Classes")

Split data randomly to training and test sets.
Fitting SVM with gamma=1 to the training set gives the following results.

train <- sample(x = 200, size = 100) # taining observations
svmfit <- svm(y ~ ., data = data[train, ], kernel = "radial", gamma = 1, cost = 1)
plot(svmfit, data[train, ])

The decision boundary is highly non-linear.

## summary of results
summary(svmfit)

## 
## Call:
## svm(formula = y ~ ., data = data[train, ], kernel = "radial", gamma = 1, 
##     cost = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  31
## 
##  ( 16 15 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  1 2

We can use CV to find best fitting gamma($\gamma$) and cost.

set.seed(1) # for exact replication
tune.out <- tune(svm, y ~ ., data = data[train, ], kernel = "radial",
                 ranges = list(cost = c(0.1, 1, 10, 100, 1000),
                               gamma = c(0.5, 1, 2, 3, 4)))
summary(tune.out) # cross-validataion results

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##     1   0.5
## 
## - best performance: 0.07 
## 
## - Detailed performance results:
##     cost gamma error dispersion
## 1  1e-01   0.5  0.26 0.15776213
## 2  1e+00   0.5  0.07 0.08232726
## 3  1e+01   0.5  0.07 0.08232726
## 4  1e+02   0.5  0.14 0.15055453
## 5  1e+03   0.5  0.11 0.07378648
## 6  1e-01   1.0  0.22 0.16193277
## 7  1e+00   1.0  0.07 0.08232726
## 8  1e+01   1.0  0.09 0.07378648
## 9  1e+02   1.0  0.12 0.12292726
## 10 1e+03   1.0  0.11 0.11005049
## 11 1e-01   2.0  0.27 0.15670212
## 12 1e+00   2.0  0.07 0.08232726
## 13 1e+01   2.0  0.11 0.07378648
## 14 1e+02   2.0  0.12 0.13165612
## 15 1e+03   2.0  0.16 0.13498971
## 16 1e-01   3.0  0.27 0.15670212
## 17 1e+00   3.0  0.07 0.08232726
## 18 1e+01   3.0  0.08 0.07888106
## 19 1e+02   3.0  0.13 0.14181365
## 20 1e+03   3.0  0.15 0.13540064
## 21 1e-01   4.0  0.27 0.15670212
## 22 1e+00   4.0  0.07 0.08232726
## 23 1e+01   4.0  0.09 0.07378648
## 24 1e+02   4.0  0.13 0.14181365
## 25 1e+03   4.0  0.15 0.13540064

The best choice is cost=1 and gamma=2.
Using these values in the test set we get

## test set results
table(true = data[-train, "y"], pred = predict(tune.out$best.model, newx = data[-train, ]))

##     pred
## true  1  2
##    1 54 23
##    2 17  6

39 out of 100 become misclassified.
Let us next produce ROC curves for the classifier.
R has a package, ROCR, for the purpose.
SVMs and support vector classifiers output class labels for each observation.
However, it is also possible to obtain fitted values for each observation that are produced.
svm() produces fitted values when setting the parameter decision.values=TRUE in the function call.

## ROC curves
#install.packages("ROCR", repos = "https://cran.uib.no")
library(ROCR)
## library(help = ROCR)

## fit svm with optimal gamma and cost, and produce also decison values
svmfit.opt <- svm(y ~ ., data = data[train, ], kernel = "radial",
                  gamma = 2, cost = 1, decision.values = TRUE) 
fitted <- attributes(predict(svmfit.opt, newdata = data[train, ],
                             decision.values = TRUE))$decision.values
head(fitted)

##           1/2
## 148  1.130044
## 192 -1.038648
## 87   1.188065
## 20   1.204742
## 112  1.195899
## 177 -1.096820

# help(performance)
## help(prediction)
par(mfrow = c(1, 2))
## training ROC plot
perf <- performance(prediction(fitted, labels = data[train, "y"]),
                    "tpr", # true positive rate
                    "fpr"  # false positive rate
                    ) # training perform
plot(perf, col = "blue", main = "Training Data")

## add ROC plot for a more flexible fit
svmfit.flex <- svm(y ~ ., data = data[train, ], kernel = "radial",
                  gamma = 50, cost = 1, decision.values = TRUE) 
fitted <- attributes(predict(svmfit.flex, newdata = data[train, ],
                             decision.values = TRUE))$decision.values
perf <- performance(prediction(fitted, labels = data[train, "y"]),
                    "tpr", # true positive rate
                    "fpr"  # false positive rate
                    ) # training perform
plot(perf, col = "red", add = TRUE)
legend("bottomright", legend = c("Optimal fit", "Flexible"),
       lty = 1, col = c("blue", "red"), bty = "n")

## test ROC plot
fitted <- attributes(predict(svmfit.opt, newdata = data[-train, ],
                             decision.values = TRUE))$decision.values
perf <- performance(prediction(fitted, labels = data[-train, "y"]),
                    "tpr", # true positive rate
                    "fpr"  # false positive rate
                    ) # training perform
plot(perf, col = "blue", main = "Test Data")
## flexible
fitted <- attributes(predict(svmfit.flex, newdata = data[-train, ],
                             decision.values = TRUE))$decision.values
perf <- performance(prediction(fitted, labels = data[-train, "y"]),
                    "tpr", # true positive rate
                    "fpr"  # false positive rate
                    ) # training perform
plot(perf, col = "red", add = TRUE)
legend("bottomright", legend = c("Optimal fit", "Flexible"),
       lty = 1, col = c("blue", "red"), bty = "n")

SVMs: more than 2 classes?

The SVM as defined works for $K=2$ classes.
What do we do if we have $K>2$ classes?
OVA: One versus All
- Fit $K$ different 2-class SVM classifiers $\hat{f}_k(x)$, $k=1,\ldots, K$; each class versus the rest.
- Classify $x*$ to the class for which $\hat{f}_k(x*)$ is largest.
OVO: One versus One
- Fit all $\binom{K}{2}$ pairwise classifiers $_{kl}(x).
- Classify $x*$ to the class that wins the most pairwise competitions.
Which to choose?
- If $K$ is not too large, use OVO.

Support vector versus logistic regression?

With $f(X)=\beta_0+\beta_1X_1+\cdots +\beta_pX_o$ can rephrase support-vector classifier optimization as

\[ minimize_{\beta_0,\beta_1,\ldots, \beta_p}\{\sum_{i=1}^n max[0, 1-y_if(x_i)]+\lambda\sum_{ij=1}^p\beta_j^2\} \]

This has the form loss plus penalty.
The loss is known as the hinge loss.
Very similar to “loss” in logistic regression (negative log-likelihood).

Which to use: SVM or logistic regression

When classes are (nearly) separable, SVM does better than LR. So does LDA.
When not, LR (with ridge penalty) and SVM very similar.
If you wish to estimate probabilities, LR is the choice.
For nonlinear boundaries, kernel SVMs are popular.
- Can use kernels with LR and LDA as well, but computations are more expensive.