# Named `clickSelects` and `showSelected`
```{r setup, echo=FALSE}
knitr::opts_chunk$set(fig.path="ch14-figures/")
```
This chapter explains how to use
[named `clickSelects` and `showSelected` variables](/ch06#data-driven-selectors)
for creating data-driven selector names.
This feature makes it easier to write animint code, and makes it faster to compile.
Chapter outline:
* We begin by attaching the `PSJ` data set and computing the data to plot.
* We show one method of defining an animint with many selectors, using
for loops. This method is technically correct, but computationally
inefficient.
* We then explain the preferred method for defining an animint with
many selectors, using named `clickSelects` and `showSelected`. This
method is more computationally efficient, and easier to code.
## Download data set {#download}
The example data come from the [PeakSegJoint package](https://github.com/tdhock/PeakSegJoint), which is for peak detection in genomic data sequences.
The code below downloads the data set.
```{r}
if(!requireNamespace("animint2data"))
remotes::install_github("animint/animint2data")
data(PSJ, package="animint2data")
sapply(PSJ, class)
```
Above we see that `PSJ` is a list of several lists and data frames.
## Explore `PSJ` data with static ggplots {#explore-data-with-static-ggplots}
We begin by a plot of some genomic ChIP-seq data, which are sequential data that take large values when there is an active area (typically actively transcribed genes).
In the code below we use show each sample in a separate panel.
```{r}
library(animint2)
ann.colors <- c(
noPeaks="#f6f4bf",
peakStart="#ffafaf",
peakEnd="#ff4c4c",
peaks="#a445ee")
(gg.cov <- ggplot()+
scale_y_continuous(
"aligned read coverage",
breaks=function(limits){
floor(limits[2])
})+
scale_x_continuous(
"position on chr11 (kilo bases = kb)")+
coord_cartesian(xlim=c(118167.406, 118238.833))+
geom_tallrect(aes(
xmin=chromStart/1e3, xmax=chromEnd/1e3,
fill=annotation),
alpha=0.5,
color="grey",
data=PSJ$filled.regions)+
scale_fill_manual(values=ann.colors)+
theme_bw()+
theme(panel.margin=grid::unit(0, "cm"))+
facet_grid(sample.id ~ ., labeller=function(df){
df$sample.id <- sub("McGill0", "", sub(" ", "\n", df$sample.id))
df
}, scales="free")+
geom_line(aes(
base/1e3, count),
data=PSJ$coverage,
color="grey50"))
```
The figure above shows the raw noisy data as a grey `geom_line()`.
Colored rectangles represent labels that indicate whether or not a peak start or end should be predicted in a given region and sample.
Next, we add a panel for segmentation problems, in which an algorithm looked for a common peak across samples.
The algorithm predicts start/end positions for a peak of large data values, in each problem.
The code below computes a table with one row for each such problem.
```{r}
library(data.table)
(show.problems <- data.table(PSJ$problems)[
, y := problem.i/max(problem.i), by=bases.per.problem][])
```
The code above added the `y` column which is used to display the problems in the code below.
```{r}
(gg.cov.prob <- gg.cov+
ggtitle("select problem")+
geom_text(aes(
chromStart/1e3, 0.9,
label=sprintf(
"%d problems mean size %.1f kb",
problems, mean.bases/1e3)),
showSelected="bases.per.problem",
data=PSJ$problem.labels,
hjust=0)+
geom_segment(aes(
problemStart/1e3, y,
xend=problemEnd/1e3, yend=y),
showSelected="bases.per.problem",
clickSelects="problem.name",
size=5,
data=show.problems))
```
Above we see a bottom `problems` panel was added to the previous plot.
The static graphic above is overplotted; the interactive version will be readable because it will only show one value of `bases.per.problem` at a time.
To select the different values of `bases.per.problem` (problem size), we will use another plot, which shows the best error rate for each problem size, as in the data below.
```{r}
(res.error <- data.table(PSJ$error.total.chunk))
```
The table above has one row per value of `bases.per.problem`, which is a sliding window size parameter, that we will explore with interactivity.
We use these data to draw the plot below.
```{r}
(gg.res.error <- ggplot()+
ggtitle("select problem size")+
ylab("minimum incorrectly predicted labels")+
geom_line(aes(
bases.per.problem, errors),
data=res.error)+
geom_tallrect(aes(
xmin=min.bases.per.problem,
xmax=max.bases.per.problem),
clickSelects="bases.per.problem",
alpha=0.5,
data=res.error)+
scale_x_log10())
```
The figure above shows the minimum number of label errors as a function of problem size.
Grey rectangles will be used to select the problem size.
There is a penalty parameter which controls the number of samples with a common peak, as defined in the model selection data table in the code below.
```{r}
pdot <- function(L){
out_list <- list()
for(i in seq_along(L)){
out_list[[i]] <- data.table(
problem.dot=names(L)[[i]], L[[i]])
}
rbindlist(out_list)
}
(all.modelSelection <- pdot(PSJ$modelSelection.by.problem))
```
The code above uses the `pdot()` function, which uses the [list of data tables idiom](/ch99#list-of-data-tables) to add a column named `problem.dot`, which will be used below to define selectors in the interactive visualization.
Below we plot the number of peaks and label errors, as a function of penalty parameter of the algorithm.
```{r}
long.modelSelection <- melt(
data.table(all.modelSelection)[, errors := as.numeric(errors)],
measure.vars=c("peaks","errors"))
log.lambda.range <- all.modelSelection[, c(
min(max.log.lambda), max(min.log.lambda))]
modelSelection.labels <- unique(all.modelSelection[, data.table(
problem.name,
bases.per.problem,
problemStart,
problemEnd,
log.lambda=mean(log.lambda.range),
peaks=max(peaks)+0.5)])
(gg.model.selection <- ggplot()+
scale_x_continuous("log(penalty)")+
geom_segment(aes(
min.log.lambda, value,
xend=max.log.lambda, yend=value),
showSelected=c("bases.per.problem", "problem.name"),
data=long.modelSelection,
size=5)+
geom_text(aes(
log.lambda, peaks,
label=sprintf(
"%.1f kb in problem %s",
(problemEnd-problemStart)/1e3, problem.name)),
showSelected=c("bases.per.problem", "problem.name"),
data=data.frame(modelSelection.labels, variable="peaks"))+
ylab("")+
facet_grid(variable ~ ., scales="free"))
```
The figure above shows `errors` (top panel) and `peaks` (bottom panel) as a function of `log(penalty)`. Again this static version is overplotted; interactivity will be used so that this figure is readable (only shows the subset of data corresponding to the selected values of `bases.per.problem` and `problem.name`).
## Interactive data visualization (incomplete) {#incomplete}
In this section, we combine the ggplots from the previous section into a linked interactive data visualization.
The code below uses `theme_animint()` to attach some display options to the previous coverage plot, and adds the `first` option to specify what data subsets should be displayed first.
```{r}
timing.incomplete.construct <- system.time({
coverage.counts <- table(PSJ$coverage$sample.id)
facet.rows <- length(coverage.counts)+1
viz.incomplete <- animint(
first=list(
bases.per.problem=6516,
problem.name="chr11:118174946-118177139"),
coverage=gg.cov.prob+theme_animint(
last_in_row=TRUE, colspan=2,
width=800, height=facet.rows*100),
resError=gg.res.error,
modelSelection=gg.model.selection)
})
```
The timing output above shows that the initial definition is fast.
Rendering this preliminary (incomplete) data viz in the code below is also fast.
```{r ch14incomplete}
before.incomplete <- Sys.time()
viz.incomplete
cat(elapsed.incomplete <- Sys.time()-before.incomplete, "seconds\n")
```
We see a data visualization above with three plots.
* On top, four ChIP-seq data profiles are shown, along with a problems panel which divides the X axis into problems in which the segmentation algorithm runs.
* Clicking the bottom left plot selects problem size, which updates the problems displayed on top.
* The bottom right plot shows the number of peaks and errors as a function of penalty (larger for fewer peaks).
## Add interactivity using for loops {#define-using-for-loops}
In this section, we add layers to the previous ggplots using a for loop, which is sub-optimal, but we show it for comparison with the better approach (named `clickSelects` and `showSelected`) which is presented in the next section.
The visualization in the previous section is incomplete. We would like to add
* rectangles in the bottom left plot which would allow us to select the number of peaks predicted in the given problem.
* segments and rectangles in the top plot which would show the predicted peaks and label errors.
One (inefficient) way of adding those would be via a for loop, which is coded below.
For every problem there is a selector (called `problem.dot`) for the number of peaks in that problem.
So in this for loop we add a few layers with `clickSelects="problem.dot"` or `showSelected="problem.dot"` to the `coverage` and `modelSelection` plots.
```{r}
viz.first <- viz.incomplete
viz.first$first <- c(viz.incomplete$first, PSJ$first)
viz.first$modelSelection <- viz.first$modelSelection+
ggtitle("select number of samples with peak in problem")
print(timing.for.construct <- system.time({
viz.for <- viz.first
viz.for$title <- "PSJ with for loops"
for(problem.dot in names(PSJ$modelSelection.by.problem)){
if(problem.dot %in% names(PSJ$peaks.by.problem)){
peaks <- PSJ$peaks.by.problem[[problem.dot]]
peaks[[problem.dot]] <- peaks$peaks
prob.peaks.names <- c(
"bases.per.problem", "problem.i", "problem.name",
"chromStart", "chromEnd", problem.dot)
prob.peaks <- unique(data.frame(peaks)[, prob.peaks.names])
prob.peaks$sample.id <- "problems"
viz.for$coverage <- viz.for$coverage +
geom_segment(aes(
chromStart/1e3, 0,
xend=chromEnd/1e3, yend=0),
clickSelects="problem.name",
showSelected=c(problem.dot, "bases.per.problem"),
data=peaks, size=7, color="deepskyblue")
}
modelSelection.dt <- PSJ$modelSelection.by.problem[[problem.dot]]
modelSelection.dt[[problem.dot]] <- modelSelection.dt$peaks
viz.for$modelSelection <- viz.for$modelSelection+
geom_tallrect(aes(
xmin=min.log.lambda,
xmax=max.log.lambda),
clickSelects=problem.dot,
showSelected=c("problem.name", "bases.per.problem"),
data=modelSelection.dt, alpha=0.5)
}
}))
```
Note the timing of the code above, which just evaluates the R code that defines this data viz. Next, we compile the data visualization.
```{r ch14for}
before.for <- Sys.time()
viz.for
cat(elapsed.for <- Sys.time()-before.for, "seconds\n")
```
Note that the compilation takes several seconds, since there are so many geoms (click Show download status table to see all of them).
Compared to the data visualization from the previous section, this one has
* blue segments that appear in the top coverage data plot, to indicate predicted peaks.
* selection rectangles that can be clicked in the bottom right plot, to change the number of samples with a peak in the selected problem.
In the next section we will create the same data viz, but more efficiently.
## Add interactivity using named `clickSelects` and `showSelected` {#define-using-named}
In this section we use named `clickSelects` and `showSelected` to create a
more efficient version of the previous data visualization. In general,
any data visualization defined using for loops in R code can be made
more efficient by instead using this method.
First, we define some common data.
```{r}
(sample.peaks <- pdot(PSJ$peaks.by.problem))
```
The output above shows a table with one row per peak that can be displayed, for different samples, problems, and interactive choices of `bases.per.problem` and `peaks` parameters.
Note the `problem.dot` column which defines the name of the selector that will store the currently selected number of peaks for that problem.
In the code below, the main idea is that for every problem, there is a selector defined by the `problem.dot` column, for the number of peaks in that problem.
We use `showSelected=c(problem.dot="peaks")` and `clickSelects=c(problem.dot="peaks")` to indicate that the selector name is found in the `problem.dot` column, and the selection value is found in the `peaks` column. The `animint2dir()` compiler creates a selection variable for every unique value of `problem.dot` (and it uses corresponding values in `peaks` to set/update the selected value/geoms).
```{r}
print(timing.named.construct <- system.time({
viz.named <- viz.first
viz.named$title <- "PSJ named clickSelects and showSelected"
viz.named$coverage <- viz.named$coverage+
geom_segment(aes(
chromStart/1e3, 0,
xend=chromEnd/1e3, yend=0),
clickSelects="problem.name",
showSelected=c(problem.dot="peaks", "bases.per.problem"),
data=sample.peaks, size=7, color="deepskyblue")
viz.named$modelSelection <- viz.named$modelSelection+
geom_tallrect(aes(
xmin=min.log.lambda,
xmax=max.log.lambda),
clickSelects=c(problem.dot="peaks"),
showSelected=c("problem.name", "bases.per.problem"),
data=all.modelSelection, alpha=0.5)
}))
```
It is clear that it takes much less time to evaluate the R code above which uses the named `clickSelects` and `showSelected`. We compile it below.
```{r ch14named}
before.named <- Sys.time()
viz.named
cat(elapsed.named <- Sys.time()-before.named, "seconds\n")
```
The animint produced above should appear to be the same as the other data viz from the previous section.
The timings above show that named `clickSelects` and `showSelected` are much faster than for loops, in both the definition and compilation steps.
## Disk usage comparison {#disk-usage}
In this section we compute the disk usage of both methods.
```{r}
viz.dirs.vec <- c("ch14incomplete", "ch14for", "ch14named")
viz.dirs.text <- paste(viz.dirs.vec, collapse=" ")
(cmd <- paste("du -ks", viz.dirs.text))
(kb.dt <- fread(cmd=cmd, col.names=c("kilobytes", "path")))
```
The table above shows that the data viz defined using for loops takes
about twice as much disk space as the data viz that used named
`clickSelects` and `showSelected`.
## Chapter summary and exercises {#ch14-exercises}
The table below summarizes the disk usage and timings presented in
this chapter.
```{r}
data.table(
kb.dt,
construct.seconds=c(
timing.incomplete.construct[["elapsed"]],
timing.for.construct[["elapsed"]],
timing.named.construct[["elapsed"]]),
compile.seconds=as.numeric(c(
elapsed.incomplete,
elapsed.for,
elapsed.named)))
```
It is clear from the table above that named `clickSelects` and `showSelected` are more efficient in both respects, and should be used instead of for loops.
Exercises:
* Use named `clickSelects` and `showSelected` to create a visualization which demonstrates over- and under-fitting, as in [this visualization of linear model and nearest neighbors](https://tdhock.github.io/2023-12-04-degree-neighbors/).
* Use named `clickSelects` and `showSelected` to create a visualization of some data from your domain of expertise.
Next, [Chapter 15](/ch15) explains how to visualize root-finding algorithms.