16 Supervised change-point detection

In this chapter we will explore several data visualizations of supervised changepoint detection models.

Chapter outline:

We begin by making several static visualizations of the intreg data set.
We then create an interactive visualization in which one plot can be click to select the number of changepoints/segments, and the other plot shows the corresponding model.
We end by showing a static visualization of the max margin linear regression model, and suggesting exercises about creating an interactive version.

16.1 Static figures

We begin by loading the intreg data set.

library(animint2)
data(intreg)
library(data.table)
lapply(intreg, function(df)data.table(df)[1:2])

$model
         line      min.L    max.L min.feature max.feature
1: regression -1.4001088 2.139100   -2.476391   -1.584656
2:      limit -0.1493744 3.389835   -2.476391   -1.584656

$annotations
   signal first.base last.base   annotation  logratio
1:   11.2   55103411 161558770 0breakpoints 0.9847716
2:    4.2  140080934 201712984  1breakpoint 0.9847716

$intervals
   signal   feature       min.L    max.L
1:    4.2 -2.152421 -1.36503599 1.136433
2:   11.2 -1.797948  0.04183316      Inf

$selection
   signal     min.L     max.L segments cost
1:    4.2      -Inf -3.566634       20    2
2:    4.2 -3.566634 -3.301154       19    2

$segments
   signal segments first.base last.base        mean
1:    4.2        1    1472476 242801018 -0.02092153
2:    4.2        2    1472476  45164626  0.35123108

$breaks
   signal      base segments
1:    4.2  45164626        2
2:    4.2 114042112        3

$signals
   signal    base  logratio
1:    4.2 1472476 0.4404207
2:    4.2 2063049 0.4594316

As shown above, it is a named list of 7 related data.frames. We begin our exploration of these data by plotting the signals in separate facets.

data.color <- "grey50"
gg.signals <- ggplot()+
  theme_bw()+
  facet_grid(signal ~ ., scales="free")+
  geom_point(aes(
    base/1e6, logratio,
    showSelected="signal"),
    color=data.color,
    data=intreg$signals)
gg.signals

Each data point plotted above shows an approximate measurement of DNA copy number (logratio), as a function of base position on a chromosome. Such data come from high-throughput assays which are important for diagnosing certain types of cancer such as neuroblastoma.

An important part of the diagnosis is detecting “breakpoints” or abrupt changes, within a given chromosome (panel). It is clear from the plot above that there are several breakpoint in these data. In particular signal 4.2 appears to have three breakpoints, signal 4.3 appears to have one, etc. In fact these data come from medical doctors at the Institute Curie (Paris, France) who have visually annotated regions with and without breakpoints. These data are available as intreg$annotations and are plotted below.

breakpoint.colors <- c("1breakpoint"="#ff7d7d", "0breakpoints"='#f6f4bf')
gg.ann <- gg.signals+
  scale_fill_manual(values=breakpoint.colors)+
  geom_tallrect(aes(
    xmin=first.base/1e6, xmax=last.base/1e6,
    fill=annotation),
    color="grey",
    alpha=0.5,
    data=intreg$annotations)
gg.ann

The plot above shows yellow regions where the doctors have determined that there are no significant breakpoints, and red regions where there is one breakpoint. The goal in analyzing these data is to learn from the limited labeled data (colored regions) and provide consistent breakpoint predictions throughout (even in un-labeled regions).

In order to detect these breakpoints we have fit some maximum likelihood segmentation models, using the efficient algorithm implemented in jointseg::Fpsn. The segment means are available in intreg$segments and the predicted breakpoints are available in intreg$breaks. For each signal there is a sequence of models from 1 to 20 segments. First let’s zoom in on one signal:

sig.name <- "4.2"
show.segs <- 7
sig.labels <- subset(intreg$annotations, signal==sig.name)
gg.one <- ggplot()+
  theme_bw()+
  theme(panel.margin=grid::unit(0, "lines"))+
  geom_tallrect(aes(
    xmin=first.base/1e6, xmax=last.base/1e6,
    fill=annotation),
    color="grey",
    alpha=0.5,
    data=sig.labels)+
  geom_point(aes(
    base/1e6, logratio),
    color=data.color,
    data=subset(intreg$signals, signal==sig.name))+
  scale_fill_manual(values=breakpoint.colors)
gg.one

We plot some of these models for one of the signals below:

sig.segs <- data.table(
  intreg$segments)[signal == sig.name & segments <= show.segs]
sig.breaks <- data.table(
  intreg$breaks)[signal == sig.name & segments <= show.segs]
model.color <- "green"
gg.models <- gg.one+
  facet_grid(segments ~ .)+
  geom_segment(aes(
    first.base/1e6, mean,
    xend=last.base/1e6, yend=mean),
    color=model.color,
    data=sig.segs)+
  geom_vline(aes(
    xintercept=base/1e6),
    color=model.color,
    linetype="dashed",
    data=sig.breaks)
gg.models

The plot above shows the maximum likelihood segmentation models in green (from one to six segments). Below we use the penaltyLearning::labelError function to compute the label error, which quantifies which models agree with which labels.

sig.models <- data.table(segments=1:show.segs, signal=sig.name)
sig.errors <- penaltyLearning::labelError(
  sig.models, sig.labels, sig.breaks,
  change.var="base",
  label.vars=c("first.base", "last.base"),
  model.vars="segments",
  problem.vars="signal")

Registered S3 methods overwritten by 'ggplot2':
  method                   from    
  drawDetails.zeroGrob     animint2
  grobHeight.absoluteGrob  animint2
  grobHeight.zeroGrob      animint2
  grobWidth.absoluteGrob   animint2
  grobWidth.zeroGrob       animint2
  grobX.absoluteGrob       animint2
  grobY.absoluteGrob       animint2
  heightDetails.titleGrob  animint2
  heightDetails.zeroGrob   animint2
  makeContext.dotstackGrob animint2
  print.ggplot2_bins       animint2
  print.rel                animint2
  widthDetails.titleGrob   animint2
  widthDetails.zeroGrob    animint2

The sig.errors$label.errors data.table contains one row for every (model,label) combination. The status column can be used to show the label error: false negative for too few changes, false positive for too many changes, or correct for the right number of changes.

gg.models+
  geom_tallrect(aes(
    xmin=first.base/1e6, xmax=last.base/1e6,
    linetype=status),
    data=sig.errors$label.errors,
    color="black",
    size=1,
    fill=NA)+
  scale_linetype_manual(
    "error type",
    values=c(
      correct=0,
      "false negative"=3,
      "false positive"=1))

Looking at the label error plot above, it is clear that the model with four segments should be selected, because it achieves zero label errors. There are a number of criteria that can be used to select which one of these models is best. One way to do that is by selecting the model with $s$ segments is $S^*(\lambda)=L_s + \lambda*s$, where $L_s$ is the total loss of the model with $s$ segments, and $\lambda$ is a non-negative penalty. In the plot below we show the model selection function $S^*(\lambda)$ for this data set:

sig.selection <- data.table(
  intreg$selection)[signal == sig.name & segments <= show.segs]
gg.selection <- ggplot()+
  theme_bw()+
  geom_segment(aes(
    min.L, segments,
    xend=max.L, yend=segments),
    data=sig.selection)+
  xlab("log(lambda)")
gg.selection

It is clear from the plot above that the model selection function is decreasing. In the next section we make an interactive version of these two plots where we can actually click on the model selection plot in order to select the model.

16.2 Interactive figures for one signal

We will create an interactive figure for one signal by adding a geom_tallrect with clickSelects=segments to the plot above:

interactive.selection <- gg.selection+
  geom_tallrect(aes(
    xmin=min.L, xmax=max.L),
    clickSelects="segments",
    data=sig.selection,
    color=NA,
    fill="black",
    alpha=0.5)
interactive.selection

We will combine that with the non-facetted version of the data/models plot below, in which we have added showSelected=segments to the model geoms:

interactive.models <- gg.one+
  geom_segment(aes(
    first.base/1e6, mean,
    xend=last.base/1e6, yend=mean),
    showSelected="segments",
    color=model.color,
    data=sig.segs)+
  geom_vline(aes(
    xintercept=base/1e6),
    showSelected="segments",
    color=model.color,
    linetype="dashed",
    data=sig.breaks)+
  geom_tallrect(aes(
    xmin=first.base/1e6, xmax=last.base/1e6,
    linetype=status),
    showSelected="segments",
    data=sig.errors$label.errors,
    size=2,
    color="black",
    fill=NA)+
  scale_linetype_manual(
    "error type",
    values=c(
      correct=0,
      "false negative"=3,
      "false positive"=1))
interactive.models

Of course the plot above is not very informative because it is not interactive. Below we combine the two interactive ggplots in a single linked animint:

animint(
  models=interactive.models+
    ggtitle("Selected model"),
  selection=interactive.selection+
    ggtitle("Click to select number of segments"))

Note that in the data viz above the model with 6 segments is not selectable for any value of lambda, so there is no way to click on the plot to select that model. However it is possible to select the model using the segments selection menu (click “Show selection menus” at the bottom of the data viz).

16.3 Static max margin regression plot

Another part of this data set is intreg$intervals which has one row for every signal. The columns min.L and max.L indicate the min/max values of the target interval, which is the largest range of log(penalty) values with minimum label errors. Below we plot this interval as a function of a feature of the data (log number of data points):

gg.intervals <- ggplot()+
  geom_segment(aes(
    feature, min.L,
    xend=feature, yend=max.L),
    size=2,
    data=intreg$intervals)+
  geom_text(aes(
    feature, min.L, label=signal,
    color=ifelse(signal==sig.name, "black", "grey50")),
    vjust=1,
    data=intreg$intervals)+
  scale_color_identity()+
  ylab("output log(lambda)")+
  xlab("input feature x")
gg.intervals

The target intervals in the plot above denote the region of log(lambda) space that will select a model with minimum label errors. There is one interval for each signal; we made an animint in the previous section for the signal indicated in black text. Machine learning algorithms can be used to find a penalty function that intersects each of the intervals, and maximizes the margin (the distance between the regression function and the nearest interval limit). Data for the linear max margin regression function are in intreg$model which is shown in the plot below:

gg.mm <- gg.intervals+
  geom_segment(aes(
    min.feature, min.L,
    xend=max.feature, yend=max.L,
    linetype=line),
    color="red",
    size=1,
    data=intreg$model)+
  scale_linetype_manual(
    values=c(
      regression="solid",
      margin="dotted",
      limit="dashed"))
gg.mm

The plot above shows the linear max margin regression function f(x) as the solid red line. It is clear that it intersects each of the black target intervals, and maximizes the margin (red vertical dotted lines). For more information on the subject of supervised changepoint detection, please see my useR 2017 tutorial.

Now that you know how to visualize each of the seven parts of the intreg data set, the rest of the chapter is devoted to exercises.

16.4 Chapter summary and exercises

Exercises:

Add a geom_text which shows the currently selected signal name at the top of the plot, in interactive.models in the first animint above.
Make an animint with two plots that shows the data set that corresponds to each interval on the max margin regression plot. One plot should show an interactive version of the max margin regression plot where you can click on an interval to select a signal. The other plot should show the data set for the currently selected signal.
In the animint you created in the previous exercise, add a third plot with the model selection function for the currently selected signal.
Re-design the previous animint so that instead of using a third plot, add a facet to the max margin regression regression plot such that the log(lambda) axes are aligned. Add another facet that shows the number of incorrect labels (intreg$selection$cost) for each log(lambda) value.
Add geoms for selecting the number of segments. Clicking the model selection plot should select the number of segments, which should update the displayed model and label errors on the plot of the data for the currently selected signal. Furthermore add a visual indication of the selected model to the max margin regression plot. The result should look something like this.
Make another data viz by starting with the facetted gg.signals plot in the beginning of this chapter. Add a plot that can be used to select the number of segments for each signal. For each signal in the facetted plot of the data, show the currently selected model for that signal (there should be a separate selection variable for each signal – you can use named clickSelects/showSelected as explained in Chapter 14). The result should look something like this.

Next, Chapter 17 explains how to visualize the K-means clustering algorithm.

# Supervised change-point detection ```{r setup, echo=FALSE} knitr::opts_chunk$set(fig.path="Ch16-figures/") ``` In this chapter we will explore several data visualizations of supervised changepoint detection models. Chapter outline: * We begin by making several static visualizations of the `intreg` data set. * We then create an interactive visualization in which one plot can be click to select the number of changepoints/segments, and the other plot shows the corresponding model. * We end by showing a static visualization of the max margin linear regression model, and suggesting exercises about creating an interactive version. ## Static figures {#static-change-point} We begin by loading the `intreg` data set. ```{r intreg} library(animint2) data(intreg) library(data.table) lapply(intreg, function(df)data.table(df)[1:2]) ``` As shown above, it is a named list of 7 related data.frames. We begin our exploration of these data by plotting the signals in separate facets. ```{r signals} data.color <- "grey50" gg.signals <- ggplot()+ theme_bw()+ facet_grid(signal ~ ., scales="free")+ geom_point(aes( base/1e6, logratio, showSelected="signal"), color=data.color, data=intreg$signals) gg.signals ``` Each data point plotted above shows an approximate measurement of DNA copy number (logratio), as a function of base position on a chromosome. Such data come from high-throughput assays which are important for diagnosing certain types of cancer such as neuroblastoma. An important part of the diagnosis is detecting "breakpoints" or abrupt changes, within a given chromosome (panel). It is clear from the plot above that there are several breakpoint in these data. In particular signal 4.2 appears to have three breakpoints, signal 4.3 appears to have one, etc. In fact these data come from medical doctors at the Institute Curie (Paris, France) who have visually annotated regions with and without breakpoints. These data are available as `intreg$annotations` and are plotted below. ```{r annotations} breakpoint.colors <- c("1breakpoint"="#ff7d7d", "0breakpoints"='#f6f4bf') gg.ann <- gg.signals+ scale_fill_manual(values=breakpoint.colors)+ geom_tallrect(aes( xmin=first.base/1e6, xmax=last.base/1e6, fill=annotation), color="grey", alpha=0.5, data=intreg$annotations) gg.ann ``` The plot above shows yellow regions where the doctors have determined that there are no significant breakpoints, and red regions where there is one breakpoint. The goal in analyzing these data is to learn from the limited labeled data (colored regions) and provide consistent breakpoint predictions throughout (even in un-labeled regions). In order to detect these breakpoints we have fit some maximum likelihood segmentation models, using the efficient algorithm implemented in `jointseg::Fpsn`. The segment means are available in `intreg$segments` and the predicted breakpoints are available in `intreg$breaks`. For each signal there is a sequence of models from 1 to 20 segments. First let's zoom in on one signal: ```{r one} sig.name <- "4.2" show.segs <- 7 sig.labels <- subset(intreg$annotations, signal==sig.name) gg.one <- ggplot()+ theme_bw()+ theme(panel.margin=grid::unit(0, "lines"))+ geom_tallrect(aes( xmin=first.base/1e6, xmax=last.base/1e6, fill=annotation), color="grey", alpha=0.5, data=sig.labels)+ geom_point(aes( base/1e6, logratio), color=data.color, data=subset(intreg$signals, signal==sig.name))+ scale_fill_manual(values=breakpoint.colors) gg.one ``` We plot some of these models for one of the signals below: ```{r models} sig.segs <- data.table( intreg$segments)[signal == sig.name & segments <= show.segs] sig.breaks <- data.table( intreg$breaks)[signal == sig.name & segments <= show.segs] model.color <- "green" gg.models <- gg.one+ facet_grid(segments ~ .)+ geom_segment(aes( first.base/1e6, mean, xend=last.base/1e6, yend=mean), color=model.color, data=sig.segs)+ geom_vline(aes( xintercept=base/1e6), color=model.color, linetype="dashed", data=sig.breaks) gg.models ``` The plot above shows the maximum likelihood segmentation models in green (from one to six segments). Below we use the `penaltyLearning::labelError` function to compute the label error, which quantifies which models agree with which labels. ```{r labelerr} sig.models <- data.table(segments=1:show.segs, signal=sig.name) sig.errors <- penaltyLearning::labelError( sig.models, sig.labels, sig.breaks, change.var="base", label.vars=c("first.base", "last.base"), model.vars="segments", problem.vars="signal") ``` The `sig.errors$label.errors` data.table contains one row for every (model,label) combination. The `status` column can be used to show the label error: `false negative` for too few changes, `false positive` for too many changes, or `correct` for the right number of changes. ```{r plotlabelerr} gg.models+ geom_tallrect(aes( xmin=first.base/1e6, xmax=last.base/1e6, linetype=status), data=sig.errors$label.errors, color="black", size=1, fill=NA)+ scale_linetype_manual( "error type", values=c( correct=0, "false negative"=3, "false positive"=1)) ``` Looking at the label error plot above, it is clear that the model with four segments should be selected, because it achieves zero label errors. There are a number of criteria that can be used to select which one of these models is best. One way to do that is by selecting the model with $s$ segments is $S^*(\lambda)=L_s + \lambda*s$, where $L_s$ is the total loss of the model with $s$ segments, and $\lambda$ is a non-negative penalty. In the plot below we show the model selection function $S^*(\lambda)$ for this data set: ```{r selection} sig.selection <- data.table( intreg$selection)[signal == sig.name & segments <= show.segs] gg.selection <- ggplot()+ theme_bw()+ geom_segment(aes( min.L, segments, xend=max.L, yend=segments), data=sig.selection)+ xlab("log(lambda)") gg.selection ``` It is clear from the plot above that the model selection function is decreasing. In the next section we make an interactive version of these two plots where we can actually click on the model selection plot in order to select the model. ## Interactive figures for one signal {#interactive-one} We will create an interactive figure for one signal by adding a `geom_tallrect` with `clickSelects=segments` to the plot above: ```{r selectionClick} interactive.selection <- gg.selection+ geom_tallrect(aes( xmin=min.L, xmax=max.L), clickSelects="segments", data=sig.selection, color=NA, fill="black", alpha=0.5) interactive.selection ``` We will combine that with the non-facetted version of the data/models plot below, in which we have added `showSelected=segments` to the model geoms: ```{r selectionData} interactive.models <- gg.one+ geom_segment(aes( first.base/1e6, mean, xend=last.base/1e6, yend=mean), showSelected="segments", color=model.color, data=sig.segs)+ geom_vline(aes( xintercept=base/1e6), showSelected="segments", color=model.color, linetype="dashed", data=sig.breaks)+ geom_tallrect(aes( xmin=first.base/1e6, xmax=last.base/1e6, linetype=status), showSelected="segments", data=sig.errors$label.errors, size=2, color="black", fill=NA)+ scale_linetype_manual( "error type", values=c( correct=0, "false negative"=3, "false positive"=1)) interactive.models ``` Of course the plot above is not very informative because it is not interactive. Below we combine the two interactive ggplots in a single linked animint: ```{r interactiveOne} animint( models=interactive.models+ ggtitle("Selected model"), selection=interactive.selection+ ggtitle("Click to select number of segments")) ``` Note that in the data viz above the model with 6 segments is not selectable for any value of lambda, so there is no way to click on the plot to select that model. However it is possible to select the model using the segments selection menu (click "Show selection menus" at the bottom of the data viz). ## Static max margin regression plot {#max-margin} Another part of this data set is `intreg$intervals` which has one row for every signal. The columns `min.L` and `max.L` indicate the min/max values of the target interval, which is the largest range of log(penalty) values with minimum label errors. Below we plot this interval as a function of a feature of the data (log number of data points): ```{r intervals} gg.intervals <- ggplot()+ geom_segment(aes( feature, min.L, xend=feature, yend=max.L), size=2, data=intreg$intervals)+ geom_text(aes( feature, min.L, label=signal, color=ifelse(signal==sig.name, "black", "grey50")), vjust=1, data=intreg$intervals)+ scale_color_identity()+ ylab("output log(lambda)")+ xlab("input feature x") gg.intervals ``` The target intervals in the plot above denote the region of log(lambda) space that will select a model with minimum label errors. There is one interval for each signal; we made an animint in the previous section for the signal indicated in black text. Machine learning algorithms can be used to find a penalty function that intersects each of the intervals, and maximizes the margin (the distance between the regression function and the nearest interval limit). Data for the linear max margin regression function are in `intreg$model` which is shown in the plot below: ```{r maxMargin} gg.mm <- gg.intervals+ geom_segment(aes( min.feature, min.L, xend=max.feature, yend=max.L, linetype=line), color="red", size=1, data=intreg$model)+ scale_linetype_manual( values=c( regression="solid", margin="dotted", limit="dashed")) gg.mm ``` The plot above shows the linear max margin regression function f(x) as the solid red line. It is clear that it intersects each of the black target intervals, and maximizes the margin (red vertical dotted lines). For more information on the subject of supervised changepoint detection, please see [my useR 2017 tutorial](https://github.com/tdhock/change-tutorial). Now that you know how to visualize each of the seven parts of the `intreg` data set, the rest of the chapter is devoted to exercises. ## Chapter summary and exercises {#Ch16-exercises} Exercises: * Add a `geom_text` which shows the currently selected signal name at the top of the plot, in `interactive.models` in the first animint above. * Make an animint with two plots that shows the data set that corresponds to each interval on the max margin regression plot. One plot should show an interactive version of the max margin regression plot where you can click on an interval to select a signal. The other plot should show the data set for the currently selected signal. * In the animint you created in the previous exercise, add a third plot with the model selection function for the currently selected signal. * Re-design the previous animint so that instead of using a third plot, add a facet to the max margin regression regression plot such that the log(lambda) axes are aligned. Add another facet that shows the number of incorrect labels (`intreg$selection$cost`) for each log(lambda) value. * Add geoms for selecting the number of segments. Clicking the model selection plot should select the number of segments, which should update the displayed model and label errors on the plot of the data for the currently selected signal. Furthermore add a visual indication of the selected model to the max margin regression plot. The result should look something [like this](https://rcdata.nau.edu/genomic-ml/animint-gallery/2016-01-28-Max-margin-interval-regression-for-supervised-segmentation-model-selection/index.html). * Make another data viz by starting with the facetted `gg.signals` plot in the beginning of this chapter. Add a plot that can be used to select the number of segments for each signal. For each signal in the facetted plot of the data, show the currently selected model for that signal (there should be a separate selection variable for each signal -- you can use named clickSelects/showSelected as explained in [Chapter 14](../Ch14/Ch14-PeakSegJoint.html)). The result should look something [like this](https://rcdata.nau.edu/genomic-ml/animint-gallery/2016-11-10-Max-margin-supervised-penalty-learning-for-peak-detection-in-ChIP-seq-data/index.html). Next, [Chapter 17](../Ch17/Ch17-k-means-clustering.html) explains how to visualize the K-means clustering algorithm.