14  Named clickSelects and showSelected

This chapter explains how to use named clickSelects and showSelected variables for creating data-driven selector names. This feature makes it easier to write animint code, and makes it faster to compile.

Chapter outline:

14.1 Download data set

The example data come from the PeakSegJoint package, which is for peak detection in genomic data sequences. The code below downloads the data set.

if(!requireNamespace("animint2data"))
  remotes::install_github("animint/animint2data")
Loading required namespace: animint2data
data(PSJ, package="animint2data")
sapply(PSJ, class)
           problem.labels                  problems            filled.regions 
             "data.frame"              "data.frame"              "data.frame" 
                 coverage                     first modelSelection.by.problem 
             "data.frame"                    "list"                    "list" 
       regions.by.problem          peaks.by.problem         error.total.chunk 
                   "list"                    "list"              "data.frame" 
          error.total.all 
             "data.frame" 

Above we see that PSJ is a list of several lists and data frames.

14.2 Explore PSJ data with static ggplots

We begin by a plot of some genomic ChIP-seq data, which are sequential data that take large values when there is an active area (typically actively transcribed genes). In the code below we use show each sample in a separate panel.

library(animint2)
ann.colors <- c(
  noPeaks="#f6f4bf",
  peakStart="#ffafaf",
  peakEnd="#ff4c4c",
  peaks="#a445ee")
(gg.cov <- ggplot()+
  scale_y_continuous(
    "aligned read coverage",
    breaks=function(limits){
      floor(limits[2])
    })+
  scale_x_continuous(
    "position on chr11 (kilo bases = kb)")+
  coord_cartesian(xlim=c(118167.406, 118238.833))+
  geom_tallrect(aes(
    xmin=chromStart/1e3, xmax=chromEnd/1e3,
    fill=annotation),
    alpha=0.5,
    color="grey",
    data=PSJ$filled.regions)+
  scale_fill_manual(values=ann.colors)+
  theme_bw()+
  theme(panel.margin=grid::unit(0, "cm"))+
  facet_grid(sample.id ~ ., labeller=function(df){
    df$sample.id <- sub("McGill0", "", sub(" ", "\n", df$sample.id))
    df
  }, scales="free")+
  geom_line(aes(
    base/1e3, count),
    data=PSJ$coverage,
    color="grey50"))

The figure above shows the raw noisy data as a grey geom_line(). Colored rectangles represent labels that indicate whether or not a peak start or end should be predicted in a given region and sample. Next, we add a panel for segmentation problems, in which an algorithm looked for a common peak across samples. The algorithm predicts start/end positions for a peak of large data values, in each problem. The code below computes a table with one row for each such problem.

library(data.table)
(show.problems <- data.table(PSJ$problems)[
, y := problem.i/max(problem.i), by=bases.per.problem][])
     sample.id bases.per.problem problem.i              problem.name
  1:  problems              1629         1 chr11:118172963-118174591
  2:  problems              1629         2 chr11:118175144-118176570
 ---                                                                
203:  problems            104267         1 chr11:118125367-118194817
204:  problems            104267         2 chr11:118194818-118266077
     problemStart problemEnd          y
  1:    118172963  118174591 0.01818182
  2:    118175144  118176570 0.03636364
 ---                                   
203:    118125367  118194817 0.50000000
204:    118194818  118266077 1.00000000

The code above added the y column which is used to display the problems in the code below.

(gg.cov.prob <- gg.cov+
  ggtitle("select problem")+
  geom_text(aes(
    chromStart/1e3, 0.9,
    label=sprintf(
      "%d problems mean size %.1f kb",
      problems, mean.bases/1e3)),
    showSelected="bases.per.problem",
    data=PSJ$problem.labels,
    hjust=0)+
  geom_segment(aes(
    problemStart/1e3, y,
    xend=problemEnd/1e3, yend=y),
    showSelected="bases.per.problem",
    clickSelects="problem.name",
    size=5,
    data=show.problems))

Above we see a bottom problems panel was added to the previous plot. The static graphic above is overplotted; the interactive version will be readable because it will only show one value of bases.per.problem at a time. To select the different values of bases.per.problem (problem size), we will use another plot, which shows the best error rate for each problem size, as in the data below.

(res.error <- data.table(PSJ$error.total.chunk))
    bases.per.problem fp fn errors regions min.bases.per.problem
 1:              1629  1  4      5      36              1349.501
 2:              2304  0  2      2      36              1908.686
---                                                             
12:             73728  0 12     12      36             61077.960
13:            104267  0 12     12      36             86377.166
    max.bases.per.problem
 1:              1966.387
 2:              2781.188
---                      
12:             88998.027
13:            125862.051

The table above has one row per value of bases.per.problem, which is a sliding window size parameter, that we will explore with interactivity. We use these data to draw the plot below.

(gg.res.error <- ggplot()+
  ggtitle("select problem size")+
  ylab("minimum incorrectly predicted labels")+
  geom_line(aes(
    bases.per.problem, errors),
    data=res.error)+
  geom_tallrect(aes(
    xmin=min.bases.per.problem,
    xmax=max.bases.per.problem),
    clickSelects="bases.per.problem",
    alpha=0.5,
    data=res.error)+
  scale_x_log10())

The figure above shows the minimum number of label errors as a function of problem size. Grey rectangles will be used to select the problem size.

There is a penalty parameter which controls the number of samples with a common peak, as defined in the model selection data table in the code below.

pdot <- function(L){
  out_list <- list()
  for(i in seq_along(L)){
    out_list[[i]] <- data.table(
      problem.dot=names(L)[[i]], L[[i]])
  }
  rbindlist(out_list)
}
(all.modelSelection <- pdot(PSJ$modelSelection.by.problem))
                        problem.dot bases.per.problem problem.i
  1: chr11.118172963.118174591peaks              1629         1
  2: chr11.118172963.118174591peaks              1629         1
 ---                                                           
979: chr11.118194818.118266077peaks            104267         2
980: chr11.118194818.118266077peaks            104267         2
                  problem.name problemStart problemEnd min.log.lambda
  1: chr11:118172963-118174591    118172963  118174591           -Inf
  2: chr11:118172963-118174591    118172963  118174591        4.56662
 ---                                                                 
979: chr11:118194818-118266077    118194818  118266077       12.13670
980: chr11:118194818-118266077    118194818  118266077       12.71684
     max.log.lambda model.complexity peaks   min.lambda   max.lambda errors
  1:        4.56662                4     4      0.00000     96.21835     NA
  2:        4.68868                3     3     96.21835    108.70956     NA
 ---                                                                       
979:       12.71684                1     1 186596.24134 333312.50992     14
980:            Inf                0     0 333312.50992          Inf     16

The code above uses the pdot() function, which uses the list of data tables idiom to add a column named problem.dot, which will be used below to define selectors in the interactive visualization. Below we plot the number of peaks and label errors, as a function of penalty parameter of the algorithm.

long.modelSelection <- melt(
  data.table(all.modelSelection)[, errors := as.numeric(errors)],
  measure.vars=c("peaks","errors"))
log.lambda.range <- all.modelSelection[, c(
  min(max.log.lambda), max(min.log.lambda))]
modelSelection.labels <- unique(all.modelSelection[, data.table(
  problem.name,
  bases.per.problem,
  problemStart,
  problemEnd,
  log.lambda=mean(log.lambda.range),
  peaks=max(peaks)+0.5)])
(gg.model.selection <- ggplot()+
   scale_x_continuous("log(penalty)")+
   geom_segment(aes(
     min.log.lambda, value,
     xend=max.log.lambda, yend=value),
     showSelected=c("bases.per.problem", "problem.name"),
     data=long.modelSelection,
     size=5)+
   geom_text(aes(
     log.lambda, peaks,
     label=sprintf(
       "%.1f kb in problem %s",
       (problemEnd-problemStart)/1e3, problem.name)),
     showSelected=c("bases.per.problem", "problem.name"),
     data=data.frame(modelSelection.labels, variable="peaks"))+
   ylab("")+
   facet_grid(variable ~ ., scales="free"))
Warning: Removed 717 rows containing missing values (geom_segment).

The figure above shows errors (top panel) and peaks (bottom panel) as a function of log(penalty). Again this static version is overplotted; interactivity will be used so that this figure is readable (only shows the subset of data corresponding to the selected values of bases.per.problem and problem.name).

14.3 Interactive data visualization (incomplete)

In this section, we combine the ggplots from the previous section into a linked interactive data visualization. The code below uses theme_animint() to attach some display options to the previous coverage plot, and adds the first option to specify what data subsets should be displayed first.

timing.incomplete.construct <- system.time({
  coverage.counts <- table(PSJ$coverage$sample.id)
  facet.rows <- length(coverage.counts)+1
  viz.incomplete <- animint(
    first=list(
      bases.per.problem=6516,
      problem.name="chr11:118174946-118177139"),
    coverage=gg.cov.prob+theme_animint(
      last_in_row=TRUE, colspan=2,
      width=800, height=facet.rows*100),
    resError=gg.res.error,
    modelSelection=gg.model.selection)
})

The timing output above shows that the initial definition is fast. Rendering this preliminary (incomplete) data viz in the code below is also fast.

before.incomplete <- Sys.time()
viz.incomplete
cat(elapsed.incomplete <- Sys.time()-before.incomplete, "seconds\n")
7.648319 seconds

We see a data visualization above with three plots.

  • On top, four ChIP-seq data profiles are shown, along with a problems panel which divides the X axis into problems in which the segmentation algorithm runs.
  • Clicking the bottom left plot selects problem size, which updates the problems displayed on top.
  • The bottom right plot shows the number of peaks and errors as a function of penalty (larger for fewer peaks).

14.4 Add interactivity using for loops

In this section, we add layers to the previous ggplots using a for loop, which is sub-optimal, but we show it for comparison with the better approach (named clickSelects and showSelected) which is presented in the next section. The visualization in the previous section is incomplete. We would like to add

  • rectangles in the bottom left plot which would allow us to select the number of peaks predicted in the given problem.
  • segments and rectangles in the top plot which would show the predicted peaks and label errors.

One (inefficient) way of adding those would be via a for loop, which is coded below. For every problem there is a selector (called problem.dot) for the number of peaks in that problem. So in this for loop we add a few layers with clickSelects="problem.dot" or showSelected="problem.dot" to the coverage and modelSelection plots.

viz.first <- viz.incomplete
viz.first$first <- c(viz.incomplete$first, PSJ$first)
viz.first$modelSelection <- viz.first$modelSelection+
  ggtitle("select number of samples with peak in problem")
print(timing.for.construct <- system.time({
  viz.for <- viz.first
  viz.for$title <- "PSJ with for loops"
  for(problem.dot in names(PSJ$modelSelection.by.problem)){
    if(problem.dot %in% names(PSJ$peaks.by.problem)){
      peaks <- PSJ$peaks.by.problem[[problem.dot]]
      peaks[[problem.dot]] <- peaks$peaks
      prob.peaks.names <- c(
        "bases.per.problem", "problem.i", "problem.name",
        "chromStart", "chromEnd", problem.dot)
      prob.peaks <- unique(data.frame(peaks)[, prob.peaks.names])
      prob.peaks$sample.id <- "problems"
      viz.for$coverage <- viz.for$coverage +
        geom_segment(aes(
          chromStart/1e3, 0,
          xend=chromEnd/1e3, yend=0),
          clickSelects="problem.name",
          showSelected=c(problem.dot, "bases.per.problem"),
          data=peaks, size=7, color="deepskyblue")
    }
    modelSelection.dt <- PSJ$modelSelection.by.problem[[problem.dot]]
    modelSelection.dt[[problem.dot]] <- modelSelection.dt$peaks
    viz.for$modelSelection <- viz.for$modelSelection+
      geom_tallrect(aes(
        xmin=min.log.lambda, 
        xmax=max.log.lambda), 
        clickSelects=problem.dot,
        showSelected=c("problem.name", "bases.per.problem"),
        data=modelSelection.dt, alpha=0.5)
  }
}))
   user  system elapsed 
  0.946   0.000   0.946 

Note the timing of the code above, which just evaluates the R code that defines this data viz. Next, we compile the data visualization.

before.for <- Sys.time()
viz.for
Warning in checkSingleShowSelectedValue(meta$selectors): showSelected variables
with only 1 level: chr11.118184422.118184700peaks,
chr11.118192951.118193582peaks, chr11.118203893.118204314peaks
cat(elapsed.for <- Sys.time()-before.for, "seconds\n")
1.14234 seconds

Note that the compilation takes several seconds, since there are so many geoms (click Show download status table to see all of them). Compared to the data visualization from the previous section, this one has

  • blue segments that appear in the top coverage data plot, to indicate predicted peaks.
  • selection rectangles that can be clicked in the bottom right plot, to change the number of samples with a peak in the selected problem.

In the next section we will create the same data viz, but more efficiently.

14.5 Add interactivity using named clickSelects and showSelected

In this section we use named clickSelects and showSelected to create a more efficient version of the previous data visualization. In general, any data visualization defined using for loops in R code can be made more efficient by instead using this method. First, we define some common data.

(sample.peaks <- pdot(PSJ$peaks.by.problem))
                         problem.dot bases.per.problem problem.i
   1: chr11.118172963.118174591peaks              1629         1
   2: chr11.118172963.118174591peaks              1629         1
  ---                                                           
1998: chr11.118194818.118266077peaks            104267         2
1999: chr11.118194818.118266077peaks            104267         2
                   problem.name problemStart problemEnd peaks  sample.id
   1: chr11:118172963-118174591    118172963  118174591     1 McGill0091
   2: chr11:118172963-118174591    118172963  118174591     2 McGill0091
  ---                                                                   
1998: chr11:118194818-118266077    118194818  118266077     4 McGill0091
1999: chr11:118194818-118266077    118194818  118266077     4 McGill0322
      chromStart  chromEnd     mean
   1:  118173535 118173819 3.802817
   2:  118173535 118173819 3.802817
  ---                              
1998:  118209753 118218118 2.190317
1999:  118209753 118218118 2.810879

The output above shows a table with one row per peak that can be displayed, for different samples, problems, and interactive choices of bases.per.problem and peaks parameters. Note the problem.dot column which defines the name of the selector that will store the currently selected number of peaks for that problem.

In the code below, the main idea is that for every problem, there is a selector defined by the problem.dot column, for the number of peaks in that problem. We use showSelected=c(problem.dot="peaks") and clickSelects=c(problem.dot="peaks") to indicate that the selector name is found in the problem.dot column, and the selection value is found in the peaks column. The animint2dir() compiler creates a selection variable for every unique value of problem.dot (and it uses corresponding values in peaks to set/update the selected value/geoms).

print(timing.named.construct <- system.time({
  viz.named <- viz.first
  viz.named$title <- "PSJ named clickSelects and showSelected"
  viz.named$coverage <- viz.named$coverage+
    geom_segment(aes(
      chromStart/1e3, 0,
      xend=chromEnd/1e3, yend=0),
      clickSelects="problem.name",
      showSelected=c(problem.dot="peaks", "bases.per.problem"),
      data=sample.peaks, size=7, color="deepskyblue")
  viz.named$modelSelection <- viz.named$modelSelection+
    geom_tallrect(aes(
      xmin=min.log.lambda,
      xmax=max.log.lambda),
      clickSelects=c(problem.dot="peaks"),
      showSelected=c("problem.name", "bases.per.problem"),
      data=all.modelSelection, alpha=0.5)
}))
   user  system elapsed 
  0.003   0.000   0.002 

It is clear that it takes much less time to evaluate the R code above which uses the named clickSelects and showSelected. We compile it below.

before.named <- Sys.time()
viz.named
cat(elapsed.named <- Sys.time()-before.named, "seconds\n")
7.336688 seconds

The animint produced above should appear to be the same as the other data viz from the previous section. The timings above show that named clickSelects and showSelected are much faster than for loops, in both the definition and compilation steps.

14.6 Disk usage comparison

In this section we compute the disk usage of both methods.

viz.dirs.vec <- c("ch14incomplete", "ch14for", "ch14named")
viz.dirs.text <- paste(viz.dirs.vec, collapse=" ")
(cmd <- paste("du -ks", viz.dirs.text))
[1] "du -ks ch14incomplete ch14for ch14named"
(kb.dt <- fread(cmd=cmd, col.names=c("kilobytes", "path")))
   kilobytes           path
1:       980 ch14incomplete
2:      3112        ch14for
3:      1400      ch14named

The table above shows that the data viz defined using for loops takes about twice as much disk space as the data viz that used named clickSelects and showSelected.

14.7 Chapter summary and exercises

The table below summarizes the disk usage and timings presented in this chapter.

data.table(
  kb.dt,
  construct.seconds=c(
    timing.incomplete.construct[["elapsed"]],
    timing.for.construct[["elapsed"]],
    timing.named.construct[["elapsed"]]),
  compile.seconds=as.numeric(c(
    elapsed.incomplete,
    elapsed.for,
    elapsed.named)))
   kilobytes           path construct.seconds compile.seconds
1:       980 ch14incomplete             0.002        7.648319
2:      3112        ch14for             0.946       68.540370
3:      1400      ch14named             0.002        7.336688

It is clear from the table above that named clickSelects and showSelected are more efficient in both respects, and should be used instead of for loops.

Exercises:

  • Use named clickSelects and showSelected to create a visualization which demonstrates over- and under-fitting, as in this visualization of linear model and nearest neighbors.
  • Use named clickSelects and showSelected to create a visualization of some data from your domain of expertise.

Next, Chapter 15 explains how to visualize root-finding algorithms.