Analysing Illumina Infinium 450k methylation arrays

Dr Alicia Oshlack, head of Bioinformatics at the Murdoch Children’s Research Institute and formerly of WEHI Bioinformatics, gave today’s Bioinformatics seminar at WEHI. Her topic was “Analysing Human Infinium 450k Methylation Arrays”, in particular the normalisation and quality control issues associated with them. I’ve fleshed out the notes I made during the seminar, below.

Alicia credited Jovana Maksimovic and Livinia Gordon (members of her lab) with most of the work she was to present today.

DNA methylation

DNA methylation is the earliest discovered, and most widely studied, epigenetic mark (histone methylation is another example of an epigenetic mark). DNA methylation a dynamic mark in that it can be added and removed during different stages of the cell life-cycle, but importantly DNA methylation is preserved during cell mitosis (cell division).

Traditionally (dogmatically?) DNA methylation in a gene’s promoter region results in repression of that gene while DNA methylation across a gene’s body results in gene expression. DNA methylation ‘going awry’ is a hallmark of cancer.

Illumina methylation arrays

There are several experiments/methods to study DNA methylation but all study the proportion of methylated cells in the study population (this may change with single-molecule sequencing). The methods for studying DNA methylation can be broadly categorised into region-level resolution or single base-pair resolution. Some examples of region-level resolution techonologies are MeDip-seq and MBD capture-seq, while whole-genome bisulfite sequencing, reduced representation bisulfite sequencing (RRBS) and Illumina Infinium methylation arrays (such as the 450k beadchip) are examples of single base-pair resolution assays.

The recently released and published Infinium HumanMethylation450 BeadChip replaces the old 27k Infinium methylation array and provides a cheap way to study DNA methylation at 450,000 CpGs across tens of human samples. I mean ‘cheap’ in comparison to either whole-genome bisulfite sequencing or even RRBS, which are still prohibitively expensive for the study of a large number of samples at this time. I believe Illumina are also developing a version of the 450k chip for mouse. The CpGs targeted by the chip are gene-biased (i.e. favouring CpGs within or close to genes) and Illumina provides extensive annotation for each target. This is a BeadChip assay like Illumina’s genotyping assays.

The 450k chip features two types of probes - Illumina I and Illumina II style probes. Infinium I probes assume the whole probe is methylated or unmethylated (25% of probes on the chip) while Infinium II probes contain redundant bases (R’s in IUPAC notation) at the G nucleotide in CpGs within the probe body. The Infinium I probes will obviously have binding problems at positions that are partially methylated since they assume uniform methylation status across the probe body. The Infinium probes can accomodate at most three CpGs in the probe body, so there are obviously CpGs that the 450k isn’t able to assay.

There are in fact two kinds of Infinium I probes - methylated (M) probes and unmethylated (U) probes. The ratio U/M is used as the basis for estimating the methylation status at each CpG. There is only a single kind of Infinium II probe with one colour reported for methylated CpG (green, I think) and another colour reported for unmethylated CpG (red, I think). These colours are fixed and so no dye swaps are possible (a common technique for estimating technical artefacts in microarrays).

Basically, the 450k chip is comprised of two very different classes of probes and these probes target different regions of the genome (owing to the assumptions/limitations required when designing the probe). Alicia then went on to discuss the normalisation of these probes.

SWAN normalisation

Normalisation and quality control procedures are essential when analysing genomics data to separate out the technical variation from the biological variation (the interesting stuff). The normalisation procedure proposed by Alicia is named Subset quantile Within Array Normalisation (SWAN). It uses quantile normalisation on probe sets that contain the same number of CpGs, i.e. quantile normalise all probes containing one CpG in the probe body, quantile normalise all probes containing two CpG in the probe body, etc. Quantile normalisation is performed relative to the Infinium I probes and Infinium II probes in each probe set to normalise the two classes of probes and separately for the U and M channels. The quantile normalisation is the within array part of the procedure; it is followed by an across array normalisation.

Alicia presented several graphs showing SWAN improved the analysis of 450k chip data compared to either no normalisation or only within-array or across-array normalisation. SWAN is an R function that should soon be available in the minfi Bioconductor R package.minfi and methylumi are two packages for the downstream analysis of Illumina methylation arrays, though Alicia noted that many functions in limma are also useful for analysing these arrays.