High-throughput genomics data are commonly summarised in a feature-by-sample matrix or higher-dimensional array. In R, these have traditionally been stored in-memory, but this is becoming prohibitive for large, contemporary datasets, such as those generated using new genomics technologies like single-cell RNA-seq. Instead, these arrays may be stored on-disk, using the Hierarchical Data Format 5 (HDF5), for example.
The Bioconductor project has developed the DelayedArray, which supports different ‘backends’ to wrap around an in-memory, on-disk, or remotely served representation of an array, providing a unified interface to the data that is familiar to users of ordinary R arrays. In this sense, a DelayedArray is to an array as a tibble is to a data frame.
I will provide an overview of the DelayedArray framework, explain the requirements for developing a new backend for a DelayedArray, and highlight example backends for on-disk and remotely served data. I will also demonstrate how user-created packages can extend the capabilities of the DelayedArray and how this has enabled us to analyse large genomics datasets in R that were previously infeasible.