12 Support Vector Machines

This goal of this chapter is to create an interactive data visualization that explains the Support Vector Machine, a machine learning model for binary classification.

Chapter summary:

We begin by simulating some data for binary classification in two dimensions, and making some static plots.
In the second section, we make an interactive data visualization to show how the linear Support Vector Machine decision boundary changes as a function of the cost hyper-parameter.
In the last section, we make an interactive data visualization to show how the decision boundary of the polynomial kernel Support Vector Machine changes as a function of the two hyper-parameters (cost and degree).

12.1 Generate and plot some data

We begin by generating two input features, x1 and x2.

library(data.table)
N <- 50
set.seed(1)
getInput <- function()runif(2*N, -1, 1)
data.dt <- data.table(
  x1=getInput(),
  x2=getInput())
library(animint2)
ggplot()+
  geom_point(aes(
    x1, x2),
    data=data.dt)

The plot below shows the same data, after computing two additional input features (the squares of the original two inputs).

data.dt[, let(
  x1.sq = x1^2,
  x2.sq = x2^2)][]

             x1         x2      x1.sq      x2.sq
  1: -0.4689827  0.3094479 0.21994475 0.09575798
  2: -0.2557522 -0.2936055 0.06540919 0.08620416
 ---                                            
 99:  0.6217405 -0.3640726 0.38656123 0.13254888
100:  0.2098666  0.5657027 0.04404398 0.32001952

ggplot()+
  geom_point(aes(
    x1.sq, x2.sq),
    data=data.dt)

In our simulation, we assume that the output score f is a linear function of x1.sq, and ignores x2.sq. The plot below visualizes the output scores using the point fill aesthetic.

data.dt[, f := x1.sq]
true.decision.boundary <- 0.2
ggplot()+
  theme_bw()+
  scale_fill_gradient2(midpoint=true.decision.boundary)+
  geom_point(aes(
    x1.sq, x2.sq, fill=f),
    color="grey",
    data=data.dt)

In particular, we assume that the label y is negative (-1) if x1.sq + noise < threshold, and positive (1) otherwise.

data.dt[
, f.noise := f+rnorm(N, 0, 0.2)
][
, y.num := ifelse(f.noise<true.decision.boundary, -1, 1)
][
, y := factor(y.num)
]
table(data.dt$y)


-1  1 
56 44

The table above shows the label distribution in the data. Below we compute a grid of x1 values, with corresponding score values f.

(scores <- data.table(x1=seq(-1, 1, l=101))[
, x1.sq := x1^2
][
, f := x1.sq ][])

        x1  x1.sq      f
  1: -1.00 1.0000 1.0000
  2: -0.98 0.9604 0.9604
 ---                    
100:  0.98 0.9604 0.9604
101:  1.00 1.0000 1.0000

The table above will be used to draw the true score function below in black. Below we visualize the scores and labels, as a function of the input feature x1. Of course, we would not be able to make this visualization with real data (only the labels are known in real data, not the scores).

x1.boundaries <- data.table(
  boundary=c(1, -1)*sqrt(true.decision.boundary))
ggplot()+
  scale_y_continuous(breaks=seq(0, 1, by=0.2))+
  geom_vline(aes(
    xintercept=boundary),
    color="grey50",
    data=x1.boundaries)+
  geom_line(aes(
    x1, f),
    data=scores)+
  geom_point(aes(
    x1, f.noise, color=y),
    fill=NA,
    data=data.dt)

Above we see a plot of the data in the original x1 feature space:

the data are colored dots,
the score function f is a black curve, and
the decision boundaries are vertical black lines.

The plot below shows the scores and labels, as a function of the squared feature x1.sq.

x1sq.boundary <- data.table(boundary=true.decision.boundary)
ggplot()+
  scale_y_continuous(breaks=seq(0, 1, by=0.2))+
  scale_x_continuous(breaks=seq(0, 1, by=0.2))+
  geom_vline(aes(
    xintercept=boundary),
    color="grey50",
    data=x1sq.boundary)+
  geom_line(aes(x1.sq, f), data=scores)+
  geom_point(aes(
    x1.sq, f.noise, color=y),
    fill=NA,
    data=data.dt)

The output above shows the transformed feature space, in which there is now only one vertical decision boundary line (instead of two). Also it is clear that the score function that we want to learn is linear in x1.sq: the previous black curve for f is now a line. The goal of the support vector machine is to learn this score function given the data points.

Next, we visualize the labels in the two-dimensional squared feature space.

ggplot()+
  scale_y_continuous(breaks=seq(0, 1, by=0.2))+
  scale_x_continuous(breaks=seq(0, 1, by=0.2))+
  geom_vline(aes(
    xintercept=boundary),
    color="grey50",
    data=x1sq.boundary)+
  geom_point(aes(
    x1.sq, x2.sq, color=y),
    fill=NA,
    data=data.dt)

In the figure above, it is clear that the decision boundary is linear in this space.

The plot below shows the input feature space (x1 and x2).

ggplot()+
  scale_y_continuous(breaks=seq(-1, 1, by=0.2))+
  scale_x_continuous(breaks=seq(-1, 1, by=0.2))+
  geom_vline(aes(
    xintercept=boundary),
    color="grey50",
    data=x1.boundaries)+
  geom_point(aes(
    x1, x2, color=y),
    fill=NA,
    data=data.dt)

In the figure above, we see two decision boundary lines, which indicate negative labels for small absolute values of x1, and positive labels otherwise. This means that the function to learn is non-linear in x1.

12.2 Interactive visualization of correspondence

The goal of this section is to create an animint which uses clickSelects to show which points in the input and squared space correspond. We first need to create a data.i variable which is a unique ID for each data point.

data.dt[, data.i := 1:.N]

Then we define a helper function which adds y.var to a data table. This is an example of the addColumn then facet idiom.

YVAR <- function(dt, y.var)data.table(dt)[, let(
  y.var = factor(y.var, c("x2", "x2.sq", "f")))]

We use the function above to construct the ggplots below, which begin with non-clickable geoms that we want to show in the background.

(gg.input <- ggplot()+
   ylab("")+
   facet_grid(y.var ~ ., scales="free")+
   geom_vline(aes(
     xintercept=boundary),
     color="grey50",
     data=x1.boundaries)+
   geom_line(aes(
     x1, f),
     data=YVAR(scores, "f")))

The plot above shows the true score function and decision boundaries in the original x1 feature space.

(gg.square <- ggplot()+
   ylab("")+
   facet_grid(y.var ~ ., scales="free")+
   geom_vline(aes(
     xintercept=boundary),
     color="grey50",
     data=x1sq.boundary)+
   geom_line(aes(
     x1.sq, f),
     data=YVAR(scores, "f")))

The plot above shows the true score function and decision boundary in the squared x1.sq feature space.

Next, we add points to represent the data in the original feature space.

input.points <- rbind(
  YVAR(data.dt, "x2")[, y.val := x2],
  YVAR(data.dt, "f")[, y.val := f.noise])
point.size <- 4
(gg.input.point <- gg.input+
  geom_point(aes(
    x1, y.val, fill=y),
    clickSelects="data.i",
    color_off=NA,
    color="black",
    size=point.size,
    data=input.points))

The figure above shows the noise feature x2 in the top facet, and the score function f in the bottom facet. Below we do the same for the squared feature space.

square.points <- rbind(
  YVAR(data.dt, "x2.sq")[, y.val := x2.sq],
  YVAR(data.dt, "f")[, y.val := f.noise])
(gg.square.point <- gg.square+
  geom_point(aes(
    x1.sq, y.val, fill=y),
    clickSelects="data.i",
    color_off=NA,
    color="black",
    size=point.size,
    data=square.points))

The figure above shows the squared noise feature x2.sq in the top facet, and the score function f in the bottom facet. Below we combine the two ggplots into an animint.

(viz.correspondence <- animint(
  input=gg.input.point+
    ggtitle("input feature space")+
    guides(fill="none"),
  square=gg.square.point+
    ggtitle("squared feature space")))

Note how we used two multi-panel plots with the addColumn then facet idiom, rather than creating four separate plots. This emphasizes the fact that some plots/facets have a common x1 or x1.sq axis. Note that we also hid the fill legend in the first plot, since it is sufficient to just have one fill legend.

12.3 Linear SVM, static plots

In this section we explain how to visualize the linear support vector machine. First we divide the data into half subtrain, half validation.

train.i <- 1:N
data.dt[
, set := "validation"
][
  train.i, set := "subtrain"
]
table(data.dt$set)


  subtrain validation 
        50         50

The output above shows that there are 50 subtrain data and 50 validation data. The subtrain data will be used as input to the SVM solver for a given cost parameter, and the validation data will be used to select the best cost hyper-parameter. The plot below visualizes each set in its own panel.

ggplot()+
  facet_grid(set ~ .)+
  geom_vline(aes(
    xintercept=boundary),
    color="grey50",
    data=x1.boundaries)+
  geom_point(aes(
    x1, x2, color=y),
    data=data.dt)

We begin by fitting a linear SVM to the subtrain data in the squared feature space.

library(kernlab)


Attaching package: 'kernlab'

The following object is masked from 'package:animint2':

    alpha

subtrain.dt <- data.dt[set=="subtrain"]
squared.mat <- subtrain.dt[, cbind(x1.sq, x2.sq)]
y.vec <- subtrain.dt$y
(fit <- ksvm(squared.mat, y.vec, kernel="vanilladot"))

 Setting default kernel parameters

Support Vector Machine object of class "ksvm" 

SV type: C-svc  (classification) 
 parameter : cost C = 1 

Linear (vanilla) kernel function. 

Number of Support Vectors : 23 

Objective Function Value : -20.7722 
Training error : 0.18

The output above indicates the SVM was fit using a linear kernel with cost=1. Next, we visualize the true labels y along with the predicted labels pred.y.

subtrain.dt$pred.y <- predict(fit)
ggplot()+
  geom_point(aes(
    x1.sq, x2.sq, color=pred.y, fill=y),
    size=4,
    stroke=2,
    data=subtrain.dt)

It is clear from the plot above that there are several mis-classified subtrain data points. Next, we want to visualize the decision boundary and margin. To do that, we first define a function which demonstrates how to compute predicted scores for the SVM.

predF <- function(fit, X){
  fit.sc <- scaling(fit)$x.scale
  if(is.null(fit.sc)){
    fit.sc <- list(
      "scaled:center"=c(0,0),
      "scaled:scale"=c(1,1))
  }
  mu <- fit.sc[["scaled:center"]]
  sigma <- fit.sc[["scaled:scale"]]
  X.sc <- scale(X, mu, sigma)
  kernelMult(
    kernelf(fit),
    X.sc,
    xmatrix(fit)[[1]],
    coef(fit)[[1]])-b(fit)
}
data.table(
  predict=predict(fit, squared.mat, type="dec"),
  predF=predF(fit, squared.mat))

    predict.V1   predF.V1
 1: -0.3733772 -0.3733772
 2: -1.2297201 -1.2297201
---                      
49: -0.6735054 -0.6735054
50: -0.9999999 -0.9999999

The code above first uses scale() and then kernelMult() to compute predicted values. The output above shows that this computation is consistent with the output from predict(). Next, we use the prediction function on a grid of values in the squared feature space.

xsq.vec <- seq(0, 1, l=41)
(grid.sq.dt <- CJ(x1.sq=xsq.vec, x2.sq=xsq.vec)[
, pred.f := predF(fit, cbind(x1.sq, x2.sq))][])

      x1.sq x2.sq    pred.f
   1:     0 0.000 -1.549034
   2:     0 0.025 -1.562162
  ---                      
1680:     1 0.975  3.512834
1681:     1 1.000  3.499706

The output above is a table of grid points with corresponding predicted values, which we use as input to geom_contour() below.

subtrain.dt[, train.error := ifelse(y==pred.y, "correct", "error")]
(gg.contour <- ggplot()+
  theme_bw()+
  scale_color_manual(values=c(error="black", correct=NA))+
  geom_point(aes(
    x1.sq, x2.sq, fill=y, color=train.error),
    stroke=1,
    size=4,
    data=subtrain.dt)+
  geom_vline(aes(
    xintercept=boundary), color="grey50",
    data=x1sq.boundary)+
  geom_contour(aes(
    x1.sq, x2.sq, z=pred.f),
    breaks=0,
    color="black",
    data=grid.sq.dt)+
  geom_contour(aes(
    x1.sq, x2.sq, z=pred.f),
    breaks=c(-1, 1),
    color="black",
    linetype="dashed",
    data=grid.sq.dt))

The plot above shows the true decision boundary using a grey vline. It also uses geom_contour() to display the decision boundary (solid black line, predicted score 0) and the margin (dashed black line, predicted score -1 and 1). Since the decision boundary and margin are linear in this space, we can also use geom_abline() to display them. To do that we need to do some math, and work out the equations for the slope and intercepts of those lines (as a function of the learned bias b(fit) and weight.vec, as well as the scale parameters mu and sigma). First, we verify that this computation of the linear SVM predicted values is consistent.

fit.sc <- scaling(fit)$x.scale
if(is.null(fit.sc)){
  fit.sc <- list(
    "scaled:center"=c(0,0),
    "scaled:scale"=c(1,1))
}
mu <- fit.sc[["scaled:center"]]
sigma <- fit.sc[["scaled:scale"]]
weight.vec <- colSums(xmatrix(fit)[[1]]*coef(fit)[[1]])
predF.linear <- function(fit, X){
  X.sc <- scale(X, mu, sigma)
  X.sc %*% weight.vec - b(fit)
}
data.table(
  predict=predict(fit, squared.mat, type="dec"),
  predF.linear=predF.linear(fit, squared.mat))

    predict.V1 predF.linear.V1
 1: -0.3733772      -0.3733772
 2: -1.2297201      -1.2297201
---                           
49: -0.6735054      -0.6735054
50: -0.9999999      -0.9999999

Above we see that the two columns have the same values, which indicates that our definition of predF.linear() is correct. To derive the linear equations for the decision boundary and margin lines, we solve the equation f(x)=c for x2 as a function of x1 (since x2 is plotted on the vertical axis, and x1 is on the horizontal). The equation of these lines is therefore x2 = m2 + s2/w2[c+b+w1*m1/s1]-s2*w1/(w2*s1)*x1, where

x is the squared input feature vector,
m is the mean vector,
s is the scale vector,
w is the learned weight vector,
each of these vectors has two components (suffix 1 and 2 above),
c is -1 or 1 for the margin lines, and 0 for the decision boundary.

Below we convert that equation to R code.

(abline.dt <- data.table(
  y=factor(c(-1,0,1)),
  boundary=c("margin", "decision", "margin"),
  intercept=mu[2]+sigma[2]/weight.vec[2]*(
    c(-1, 0, 1)+b(fit)+weight.vec[1]*mu[1]/sigma[1]),
  slope=-weight.vec[1]*sigma[2]/(weight.vec[2]*sigma[1])))

    y boundary intercept    slope
1: -1   margin -1.045536 10.61441
2:  0 decision -2.949856 10.61441
3:  1   margin -4.854175 10.61441

The table above has three rows, one for each line that we want to draw for the level curves of the predicted values from the linear SVM. Below we add these to the previous plot using geom_abline().

gg.contour+
  scale_linetype_manual(values=c(margin="dashed", decision="solid"))+
  geom_abline(aes(
    slope=slope, intercept=intercept, linetype=boundary),
    color="green",
    alpha=0.5,
    size=1,
    data=abline.dt)

The plot above confirms that our computation of the slope and intercepts (green lines) agrees with the contours (black lines). In the plot below, we show the learned alpha coefficients of the kernel function, and add a geom_segment() to visualize the slack.

subtrain.dt[, alpha := 0]
train.row.vec <- as.integer(rownames(xmatrix(fit)[[1]]))
subtrain.dt[train.row.vec, alpha := kernlab::alpha(fit)[[1]] ]
subtrain.dt[, status := fcase(
  alpha==0, "alpha=0",
  alpha==1, "alpha=C",
  default="0<alpha<C")]
slack.slope <- weight.vec[2]*sigma[1]/(weight.vec[1]*sigma[2])
slack.dt <- subtrain.dt[alpha==1]
slack.join <- abline.dt[slack.dt, on=list(y)]
slack.join[, x1.sq.margin := (
  x2.sq-slack.slope*x1.sq-intercept)/(slope-slack.slope)]
slack.join[, x2.sq.margin := slope*x1.sq.margin + intercept]
sv.colors <- c(
  "alpha=0"="white",
  "0<alpha<C"="black",
  "alpha=C"="grey")
ggplot()+
  theme_bw()+
  scale_linetype_manual(values=c(margin="dashed", decision="solid"))+
  geom_vline(aes(
    xintercept=boundary), color="violet",
    data=x1sq.boundary)+
  geom_abline(aes(
    slope=slope, intercept=intercept, linetype=boundary),
    size=1,
    data=abline.dt)+
  geom_segment(aes(
    x1.sq, x2.sq,
    xend=x1.sq.margin, yend=x2.sq.margin),
    color="grey",
    data=slack.join)+
  scale_fill_manual(values=sv.colors, breaks=names(sv.colors))+
  geom_point(aes(
    x1.sq, x2.sq, color=y, fill=status),
    size=4,
    data=subtrain.dt)

The plot above shows the slack in grey segments, and the decision and margin lines in black. The Bayes decision boundary is shown in the background as a vertical violet line. The support vectors are the points with non-zero alpha coefficients. Black filled support vectors are on the margin, and grey support vectors are on the wrong side of the margin (and have non-zero slack). The plot below shows the model that was learned in the original feature space,

n.grid <- 41
x.vec <- seq(-1, 1, l=n.grid)
grid.dt <- data.table(expand.grid(
  x1=x.vec,
  x2=x.vec))
getBoundaryDF <- function(score.vec, level.vec=c(-1, 0, 1)){
  stopifnot(length(score.vec) == n.grid * n.grid)
  several.paths <- contourLines(
    x.vec, x.vec,
    matrix(score.vec, n.grid, n.grid),
    levels=level.vec)
  contour.list <- list()
  for(path.i in seq_along(several.paths)){
    contour.list[[path.i]] <- with(several.paths[[path.i]], data.table(
      path.i,
      level.num=as.numeric(level),
      level.fac=factor(level, level.vec),
      boundary=ifelse(level==0, "decision", "margin"),
      x1=x, x2=y))
  }
  do.call(rbind, contour.list)
}
grid.dt[, pred.f := predF(fit, cbind(x1^2, x2^2))]
boundaries <- grid.dt[, getBoundaryDF(pred.f)]
ggplot()+
  scale_linetype_manual(values=c(margin="dashed", decision="solid"))+
  geom_vline(aes(
    xintercept=boundary),
    color="violet",
    data=x1.boundaries)+
  geom_path(aes(
    x1, x2, group=path.i, linetype=boundary),
    size=1,
    data=boundaries)+
  scale_fill_manual(values=sv.colors, breaks=names(sv.colors))+
  scale_size_manual(values=c(correct=2, error=4))+
  geom_point(aes(
    x1, x2, color=y,
    size=train.error,
    fill=status),
    data=subtrain.dt)

The figure above shows the SVM in the original feature space. The level curves in black show that the learned decision function is non-linear in the original space (even though it was a linear SVM in the squared feature space). This is not a typical usage of SVM, since we would typically want to fit a non-linear kernel to the original space (which implicitly computes a linear function in a higher dimensional space that we do not need to explicitly compute). We fit the linear SVM in the squared feature space in order to demonstrate how it generates a non-linear function in the original space.

12.4 Linear SVM, interactive plots

Next, the goal below will be to make an animint that shows how the decision boundary, margin, and slack change as a function of the cost parameter. The code below is a for loop over cost parameter values. In each iteration of the for loop, we use the code in the previous section, with the list of data tables idiom.

modelInfo.list <- list()
predictions.list <- list()
slackSegs.list <- list()
modelLines.list <- list()
inputBoundaries.list <- list()
setErrors.list <- list()
cost.by <- 0.2
for(cost.param in round(10^seq(-1, 1, by=cost.by),1)){
  fit <- ksvm(
    squared.mat, y.vec, kernel="vanilladot", scaled=FALSE, C=cost.param)
  fit.sc <- scaling(fit)$x.scale
  if(is.null(fit.sc)){
    fit.sc <- list(
      "scaled:center"=c(0,0),
      "scaled:scale"=c(1,1))
  }
  mu <- fit.sc[["scaled:center"]]
  sigma <- fit.sc[["scaled:scale"]]
  weight.vec <- colSums(xmatrix(fit)[[1]]*coef(fit)[[1]])
  grid.sq.dt[, pred.f := predF(fit, cbind(x1.sq, x2.sq))]
  data.dt[, pred.y := predict(fit, cbind(x1.sq, x2.sq))]
  one.error <- data.dt[, list(errors=sum(y!=pred.y)), by=set]
  setErrors.list[[paste(cost.param)]] <- data.table(
    cost.param, one.error)
  subtrain.dt[, pred.f := predF(fit, cbind(x1^2, x2^2))]
  grid.dt[, pred.f := predF(fit, cbind(x1^2, x2^2))]
  boundaries <- getBoundaryDF(grid.dt$pred.f)
  inputBoundaries.list[[paste(cost.param)]] <- data.table(
    cost.param, boundaries)
  subtrain.dt$alpha <- 0
  train.row.vec <- as.integer(rownames(xmatrix(fit)[[1]]))
  subtrain.dt[train.row.vec, alpha := kernlab::alpha(fit)[[1]] ]
  subtrain.dt[, status := ifelse(
    alpha==0, "alpha=0",
    ifelse(alpha==cost.param, "alpha=C", "0<alpha<C"))]
  slack.slope <- weight.vec[2]*sigma[1]/(weight.vec[1]*sigma[2])
  abline.dt <- data.table(
    y=factor(c(-1,0,1)),
    boundary=c("margin", "decision", "margin"),
    intercept=mu[2]+sigma[2]/weight.vec[2]*(
      c(-1, 0, 1)+b(fit)+weight.vec[1]*mu[1]/sigma[1]),
    slope=-weight.vec[1]*sigma[2]/(weight.vec[2]*sigma[1]))
  slack.dt <- subtrain.dt[alpha==cost.param]
  slack.join <- abline.dt[slack.dt, on=list(y)]
  slack.join[, x1.sq.margin := (
    x2.sq-slack.slope*x1.sq-intercept)/(slope-slack.slope)]
  slack.join[, x2.sq.margin := slope*x1.sq.margin + intercept]
  norm.weights <- as.numeric(weight.vec %*% weight.vec)
  modelInfo.list[[paste(cost.param)]] <- data.table(
    cost.param,
    slack=slack.join[, sum(1-pred.f*y.num)],
    norm=norm.weights,
    margin=2/sqrt(norm.weights))
  predictions.list[[paste(cost.param)]] <- data.table(
    cost.param, subtrain.dt)
  slackSegs.list[[paste(cost.param)]] <- data.table(
    cost.param, slack.join)
  modelLines.list[[paste(cost.param)]] <- data.table(
    cost.param, abline.dt)
}

 Setting default kernel parameters  
 Setting default kernel parameters  
 Setting default kernel parameters  
 Setting default kernel parameters  
 Setting default kernel parameters  
 Setting default kernel parameters  
 Setting default kernel parameters  
 Setting default kernel parameters  
 Setting default kernel parameters  
 Setting default kernel parameters  
 Setting default kernel parameters

inputBoundaries <- do.call(rbind, inputBoundaries.list)
predictions <- do.call(rbind, predictions.list)
slackSegs <- do.call(rbind, slackSegs.list)
modelLines <- do.call(rbind, modelLines.list)
setErrors <- do.call(rbind, setErrors.list)
(modelInfo <- do.call(rbind, modelInfo.list))

    cost.param    slack       norm    margin
 1:        0.1 37.63794  0.6362061 2.5074431
 2:        0.2 31.27588  2.5448243 1.2537216
---                                         
10:        6.3 19.91634 22.5760602 0.4209262
11:       10.0 19.91682 22.5660064 0.4210199

The output above is a table with one row per cost parameter considered, and columns for different properties of the model.

slack is the amount of hinge error.
norm is the norm of the learned weight vector.
margin is the width of the tube between the two margin lines.

Below we convert these data into a form more useful for plotting.

modelInfo.tall <- melt(modelInfo, id.vars="cost.param")
grid.sq.dt$boundary <- "true"
setErrors$variable <- "errors"
inputBoundaries[, boundary := ifelse(
  level.num==0, "decision", "margin")]
slackSegs$boundary <- "margin"
set.label.select <- data.table(
  cost.param=range(setErrors$cost.param),
  set=c("validation", "subtrain"),
  hjust=c(1, 0))
set.labels <- setErrors[set.label.select, on=list(cost.param, set)]

We begin the data visualization with a plot for selecting the regularization.

(gg.select.reg <- ggplot()+
  ggtitle("Select regularization parameter")+
  scale_x_continuous(limits=c(-1.5, 1.5))+
  theme_bw()+
  facet_grid(variable ~ ., scales="free")+
  geom_line(aes(
    log10(cost.param), errors,
    group=set, color=set),
    data=setErrors)+
  geom_text(aes(
    log10(cost.param), errors-1, label=set,
    hjust=hjust,
    color=set),
    data=set.labels)+
  guides(color="none")+
  geom_line(aes(
    log10(cost.param), log10(value)),
    data=modelInfo.tall)+
  geom_tallrect(aes(
    xmin=log10(cost.param)-cost.by/2,
    xmax=log10(cost.param)+cost.by/2),
    clickSelects="cost.param",
    alpha=0.5,
    data=modelInfo))

The figure above shows a plot of various properties on the Y axis, versus the cost parameter on the X axis. Next, we create an input space plot in the code below.

gg.linear.input <- ggplot()+
  theme_bw()+
  ggtitle("Input space features")+
  scale_fill_manual(values=sv.colors, breaks=names(sv.colors))+
  geom_vline(aes(
    xintercept=boundary),
    color="violet",
    data=x1.boundaries)+
  guides(color="none", fill="none", linetype="none")+
  scale_linetype_manual(values=c(
    "-1"="dashed",
    "0"="solid",
    "1"="dashed"))+
  geom_path(aes(
    x1, x2,
    group=path.i,
    linetype=level.fac),
    showSelected=c("boundary", "cost.param"),
    color="black",
    data=inputBoundaries)+
  geom_point(aes(
    x1, x2, fill=status),
    showSelected=c("status", "y", "data.i", "cost.param"),
    size=5,
    color="black",
    data=predictions)+
  geom_point(aes(
    x1, x2, color=y, fill=status),
    showSelected=c("cost.param", "status", "y"),
    clickSelects="data.i",
    size=3,
    data=predictions)
gg.linear.input+facet_wrap("cost.param")

The static visualization above shows different cost values in different panels (whereas in the animint they will be shown using interactivity). We create a similar plot for the squared feature space in the code below.

gg.linear.square <- ggplot()+
  theme_bw()+
  ggtitle("Kernel space features")+
  geom_vline(aes(
    xintercept=boundary), color="violet",
    data=x1sq.boundary)+
  ##coord_equal(xlim=c(0,1), ylim=c(0,1))+
  geom_abline(aes(
    slope=slope, intercept=intercept, linetype=boundary),
    showSelected="cost.param",
    color="black",
    data=modelLines)+
  scale_linetype_manual(values=c(
    decision="solid",
    margin="dashed",
    true="solid"))+
  geom_segment(aes(
    x1.sq, x2.sq,
    xend=x1.sq.margin, yend=x2.sq.margin),
    showSelected=c("cost.param", "boundary"),
    color="grey",
    data=slackSegs)+
  geom_point(aes(
    x1.sq, x2.sq, fill=status),
    showSelected=c("data.i", "cost.param"),
    size=5,
    color="black",
    data=predictions)+
  geom_point(aes(
    x1.sq, x2.sq, color=y, fill=status),
    clickSelects="data.i",
    showSelected="cost.param",
    size=3,
    data=predictions)+
  scale_fill_manual(values=sv.colors, breaks=names(sv.colors))
gg.linear.square+facet_wrap("cost.param")

Above we see plots of the squared feature space, with one panel per cost value. Below we combine the previous ggplots into an animint.

(viz.linear.svm <- animint(
  selectModel=gg.select.reg,
  inputSpace=gg.linear.input+
    theme_animint(last_in_row=TRUE),
  kernelSpace=gg.linear.square+
    theme_animint(width=700, colspan=2)))

12.5 Non-linear polynomial kernel SVM

In the previous section we fit a linear kernel in the squared feature space, which resulted in learning a function which is non-linear in terms of the original feature space. In this section we directly fit a non-linear polynomial kernel in the original space. The code below includes for loops over two hyper-parameters: cost and kernel (polynomial kernel degree).

predictions.list <- list()
inputBoundaries.list <- list()
setErrors.list <- list()
cost.by <- 0.2
orig.mat <- subtrain.dt[, cbind(x1, x2)]
for(cost.param in 10^seq(-1, 3, by=cost.by)){
  for(degree.num in seq(1, 6, by=1)){
    k <- polydot(degree.num, offset=0)
    fit <- ksvm(
      orig.mat, y.vec, kernel=k, scaled=FALSE, C=cost.param)
    grid.dt[, pred.f := predF(fit, cbind(x1, x2))]
    grid.dt[, pred.y := predict(fit, cbind(x1, x2))]
    grid.dt[, stopifnot(sign(pred.f) == pred.y)]
    data.dt[, pred.y := predict(fit, cbind(x1, x2))]
    one.error <- data.dt[, list(errors=sum(y != pred.y)), by=set]
    setErrors.list[[paste(cost.param, degree.num)]] <- data.table(
      cost.param, degree.num, one.error)
    boundaries <- getBoundaryDF(grid.dt$pred.f)
    if(is.data.frame(boundaries) && nrow(boundaries)){
      cost.deg <- paste(cost.param, degree.num)
      inputBoundaries.list[[cost.deg]] <- data.table(
        cost.param, degree.num, boundaries)
    }
    subtrain.dt[, alpha := 0]
    train.row.vec <- as.integer(rownames(xmatrix(fit)[[1]]))
    subtrain.dt[train.row.vec, alpha := kernlab::alpha(fit)[[1]] ]
    subtrain.dt[, status := ifelse(
      alpha==0, "alpha=0",
      ifelse(alpha==cost.param, "alpha=C", "0<alpha<C"))]
    predictions.list[[paste(cost.param, degree.num)]] <- data.table(
      cost.param, degree.num, subtrain.dt)
  }
}
inputBoundaries <- do.call(rbind, inputBoundaries.list)
predictions <- do.call(rbind, predictions.list)
(setErrors <- do.call(rbind, setErrors.list))

     cost.param degree.num        set errors
  1:      1e-01          1   subtrain     22
  2:      1e-01          1 validation     22
 ---                                        
251:      1e+03          6   subtrain      9
252:      1e+03          6 validation      9

The output above shows a table with one row per hyper-parameter combination and set (subtrain or validation). Below we visualize these data using two linked plots.

validationErrors <- setErrors[set=="validation"]
validationErrors$select <- "degree"
setErrors$select <- "cost"
animint(
  selectModel=ggplot()+
    ggtitle("Select hyper parameters")+
    geom_tallrect(aes(
      xmin=log10(cost.param)-cost.by/2,
      xmax=log10(cost.param)+cost.by/2),
      clickSelects="cost.param",
      alpha=0.5,
      data=setErrors[degree.num==1 & set=="subtrain",])+
    theme_bw()+
    theme(panel.margin=grid::unit(0, "lines"))+
    theme_animint(width=350, rowspan=1)+
    facet_grid(select ~ ., scales="free")+
    ylab("")+
    geom_line(aes(
      log10(cost.param), errors,
      key=set,
      group=set,
      color=set),
      showSelected="degree.num",
      data=setErrors)+
    scale_fill_gradient("validErr", low="white", high="red")+
    geom_tile(aes(
      log10(cost.param), degree.num, fill=errors),
      clickSelects="degree.num",
      data=validationErrors),
  inputSpace=ggplot()+
    theme_bw()+
    ggtitle("Input space features")+
    scale_fill_manual(values=sv.colors, breaks=names(sv.colors))+
    geom_vline(aes(
      xintercept=boundary),
      color="violet",
      data=x1.boundaries)+
    scale_linetype_manual(values=c(
      margin="dashed",
      decision="solid"))+
    geom_path(aes(
      x1, x2,
      group=path.i,
      linetype=boundary),
      showSelected=c("degree.num", "cost.param"),
      color="black",
      data=inputBoundaries)+
    geom_point(aes(
      x1, x2, color=y, fill=status),
      showSelected=c("cost.param", "degree.num"),
      size=3,
      data=predictions))

Above we see two linked plots:

On the left bottom we can select the polynomial kernel degree.
On the left top we can select the cost parameter.
On the right the corresponding data and boundaries are shown.

12.6 Chapter summary and exercises

We used ggplots to visualize the Support Vector Machine model for binary classification. We used animint and interactivity to show how the SVM decision boundary changes as a function of the model hyper-parameters.

Exercises:

For viz.correspondence, add X scales with breaks that emphasize the correspondence between the original and squared feature space (-0.5 and 0.5 squared make 0.25, etc).
Use HTML table layout for viz.linear.svm, so that the two feature space plots appear beside each other, and the “Select regularization parameter” plot appears above or below.
For viz.linear.svm, change color scales so that subtrain/validation in the Select regularization parameter plot have different colors from y in the other plots.
For viz.linear.svm, add coord_equal(xlim=c(0,1), ylim=c(0,1)) and observe that the grey slack line segments are no longer correctly represented, because some values beyond the limits are clipped to the limits. Fix this by doing the clipping yourself in a new data set for the geom_segment() in which the values are inside the limits, and correctly represented. Hint: use linear equations, solve for x=0 or y=0 to find the (x,y) coordinates of the points on the limits.
Use rbfdot as the kernel function. Compute subtrain and validation error, then add a new panel to the “select hyper parameters” plot.
Default scales use the same two colors for the y and set legends, which could be confusing. Change the colors in one of the two legends so that they are different.
Use color and color_off parameters to change the appearance of the geom_tile() when selected or not, as explained in Chapter 6, section Specifying how selection state is displayed.

Next, Chapter 13 explains how to visualize the Poisson regression model.