# Update (21/02/2014)

The `voom`

method has now been
published in Genome Biology

# Original post

Gordon Smyth
is well known for his development of the **limma** Bioconductor package for
the analysis of differential gene expression using microarrays. More recently
his group has led the way in the development of software for the statistical
analysis of gene expression using RNA-seq with the `edgeR`

Bioconductor package
and the `voom()`

method in **limma**. Today, Gordon spoke about modeling the
variance in RNA-seq data for studying gene expression, in particular using the
`voom()`

method, and contrasted this approach with that taken by `edgeR`

(and
other Poisson/Negative binomial-like methods).

## Mean-variance relationship is basically quadratic for RNA-seq counts

Gordon started by showing that the observed variance in RNA-seq experiments is
comprised of two components - biological variance (i.e. the interesting stuff)
and technical variance (e.g. measurement error). By comparing two biologically
identical, but separately sequenced samples, Gordon showed that the technical
variation is very well modelled by Poisson variation. However, the variation
observed between two different biological samples is greater than Poisson
variation; in statistical terminology the biological variation displays
*overdispersion*.

There have been repeated (and misguided) claims in the literature that RNA-seq
allows experiments to be performed without biological replicates. I think the
reasons why this misguided notion is appealing is because RNA-seq promises to
provide more *precise* estimates of gene expression than was possible with
microarrays and that is interpreted as “more precise measurements means less
replicates required”. This simply isn’t true - the technical variation may be
less in RNA-seq compared to microarrays (debatable) but the large biological
variation between samples still remains. This is why it is **essential** to
have biological replicates in order to derive robust inferences from the data.

The overdispersion of the variance is observed as a quadratic mean-variance
relationship, as opposed to the linear mean-variance relationship implicit in
a Poisson model. All the data analysed by Gordon’s group clearly displays a
quadratic mean-variance structure, reinforcing that methods based on a Poisson
model will not suffice. This led to the development of the negative binomial
model used by `edgeR`

(similar models of the variance are used by `DESeq`

and
others).

The overdispersion parameter for gene $g $, $\varphi_g$, must be
estimated from the data and there are several ways to do this. Once
$\varphi_g$ is estimated the data can be modelled using a generalised
linear model (more specifically, a log linear model). One estimator of
$\varphi_g$ is the so-called common dispersion model where each gene has
the same dispersion value $\varphi_g := \varphi $. A second estimator uses
empirical Bayesian methods to “squeeze” the gene wise $\varphi_g$ towards
the common value $\varphi$. This sort of “squeezing” is necessary for the
small sample sizes common to RNA-seq experiments. A third estimator is the
*trend* estimator employed in `DESeq`

. The estimation of the dispersion
parameter, and which estimator to use, is a field of ongoing research.

## Modelling the variation is more important than getting the distribution right

The `edgeR`

work emphasised the development of *exact conditional tests* based
on the negative binomial distribution. Gordon noted that he now considers this
less important than doing a good job of the mean-variance relationship. The
`voom()`

method is being developed to use this alternate angle to tackle
differential gene expression analysis from RNA-seq data.

The read counts of gene $g$ in sample $i$, $y_{gi}$, do not follow
a Normal distribution, however ,the transformed response variable
$log(y_{gi} + 0.5)$ converges quickly to normality. `voom()`

models the
log counts per million and fits a loess trend line to the scatterplot of
variance vs. mean to create weights that are then fed into a standard
**limma** analysis.

The variance model is at the observation level and the loess trend is robust
against highly variable genes. Gordon presented the results of a simulation
study where `voom()`

bettered `edgeR`

, `DESeq`

and several other popular
RNA-seq analysis methods. Of course, authors always display simulation results
that favour their method, but it seemed like the simulation parameters were
designed to favour the negative binomial-based models, such as `edgeR`

, so I
found the simulation results quite interesting.

## Gene-specific variation must be accounted for in the model

The idea of gene-specific variances has long been accepted in the microarray
world, but is still not widely accepted for the analysis of RNA-seq data.
Gordon presented several plots showing that *tagwise* estimates of dispersion
had better goodness of fit statistics than either the common or trend estimates
of dispersion.

In short,
**gene-specific variance exists and must be accounted for in the analysis**.

## Conclusions

To conclude, Gordon summarised the pros and cons of `edgeR`

and `voom()`

.

`edgeR`

#### Pros

- Fits an intuitive model
- The biological coefficient of variation (the biological variance divided by the mean expression) is interpretable
- Excellent statistical power

#### Cons

- It treats the dispersion as known (once estimated) and so can be a little liberal
- Can’t estimate the optimal prior weight (the prior weight is used in the empirical Bayes shrinking of the dispersion estimates)
- Computationally challenging to program (e.g. fitting $\approx 30,000$ GLMs, one per gene)

`voom()`

#### Pros

- More agnostic to the mean-variance relationship
- A seemingly natural although somewhat ad-hoc fold change shrinkage
- Easily estimates the prior weight
- Holds its size since it tracks the uncertainty of the empirical Bayes estimates throughout the model
- Feeds into many existing
**limma**tools - Wins all comparisons with other methods so far…

#### Cons

- Still determining…