Variance models for RNA-seq

Update (21/02/2014)

The voom method has now been published in Genome Biology

Original post

Gordon Smyth is well known for his development of the limma Bioconductor package for the analysis of differential gene expression using microarrays. More recently his group has led the way in the development of software for the statistical analysis of gene expression using RNA-seq with the edgeR Bioconductor package and the voom() method in limma. Today, Gordon spoke about modeling the variance in RNA-seq data for studying gene expression, in particular using the voom() method, and contrasted this approach with that taken by edgeR (and other Poisson/Negative binomial-like methods).

Mean-variance relationship is basically quadratic for RNA-seq counts

Gordon started by showing that the observed variance in RNA-seq experiments is comprised of two components - biological variance (i.e. the interesting stuff) and technical variance (e.g. measurement error). By comparing two biologically identical, but separately sequenced samples, Gordon showed that the technical variation is very well modelled by Poisson variation. However, the variation observed between two different biological samples is greater than Poisson variation; in statistical terminology the biological variation displays overdispersion.

There have been repeated (and misguided) claims in the literature that RNA-seq allows experiments to be performed without biological replicates. I think the reasons why this misguided notion is appealing is because RNA-seq promises to provide more precise estimates of gene expression than was possible with microarrays and that is interpreted as “more precise measurements means less replicates required”. This simply isn’t true - the technical variation may be less in RNA-seq compared to microarrays (debatable) but the large biological variation between samples still remains. This is why it is essential to have biological replicates in order to derive robust inferences from the data.

The overdispersion of the variance is observed as a quadratic mean-variance relationship, as opposed to the linear mean-variance relationship implicit in a Poisson model. All the data analysed by Gordon’s group clearly displays a quadratic mean-variance structure, reinforcing that methods based on a Poisson model will not suffice. This led to the development of the negative binomial model used by edgeR (similar models of the variance are used by DESeq and others).

The overdispersion parameter for gene $g $, $\varphi_g$, must be estimated from the data and there are several ways to do this. Once $\varphi_g$ is estimated the data can be modelled using a generalised linear model (more specifically, a log linear model). One estimator of $\varphi_g$ is the so-called common dispersion model where each gene has the same dispersion value $\varphi_g := \varphi $. A second estimator uses empirical Bayesian methods to “squeeze” the gene wise $\varphi_g$ towards the common value $\varphi$. This sort of “squeezing” is necessary for the small sample sizes common to RNA-seq experiments. A third estimator is the trend estimator employed in DESeq. The estimation of the dispersion parameter, and which estimator to use, is a field of ongoing research.

Modelling the variation is more important than getting the distribution right

The edgeR work emphasised the development of exact conditional tests based on the negative binomial distribution. Gordon noted that he now considers this less important than doing a good job of the mean-variance relationship. The voom() method is being developed to use this alternate angle to tackle differential gene expression analysis from RNA-seq data.

The read counts of gene $g$ in sample $i$, $y_{gi}$, do not follow a Normal distribution, however ,the transformed response variable $log(y_{gi} + 0.5)$ converges quickly to normality. voom() models the log counts per million and fits a loess trend line to the scatterplot of variance vs. mean to create weights that are then fed into a standard limma analysis.

The variance model is at the observation level and the loess trend is robust against highly variable genes. Gordon presented the results of a simulation study where voom() bettered edgeR, DESeq and several other popular RNA-seq analysis methods. Of course, authors always display simulation results that favour their method, but it seemed like the simulation parameters were designed to favour the negative binomial-based models, such as edgeR, so I found the simulation results quite interesting.

Gene-specific variation must be accounted for in the model

The idea of gene-specific variances has long been accepted in the microarray world, but is still not widely accepted for the analysis of RNA-seq data. Gordon presented several plots showing that tagwise estimates of dispersion had better goodness of fit statistics than either the common or trend estimates of dispersion.

In short, gene-specific variance exists and must be accounted for in the analysis.

Conclusions

To conclude, Gordon summarised the pros and cons of edgeR and voom().

edgeR

Pros

  • Fits an intuitive model
  • The biological coefficient of variation (the biological variance divided by the mean expression) is interpretable
  • Excellent statistical power

Cons

  • It treats the dispersion as known (once estimated) and so can be a little liberal
  • Can’t estimate the optimal prior weight (the prior weight is used in the empirical Bayes shrinking of the dispersion estimates)
  • Computationally challenging to program (e.g. fitting $\approx 30,000$ GLMs, one per gene)

voom()

Pros

  • More agnostic to the mean-variance relationship
  • A seemingly natural although somewhat ad-hoc fold change shrinkage
  • Easily estimates the prior weight
  • Holds its size since it tracks the uncertainty of the empirical Bayes estimates throughout the model
  • Feeds into many existing limma tools
  • Wins all comparisons with other methods so far…

Cons

  • Still determining…