Data Quality Metrics


Download all QC metrics through Data Release 5


The Epigenome Roadmap consortium has undertaken to quantify ChIP-seq, DNaseI and MeDIP short-read sequencing signal quality in a uniform way across experiments and mapping centers. The goals in calculating these quality metrics across all consortium data are to

  1. provide feedback to the centers and potentially flag data sets of poor quality
  2. provide downstream users with the quality scores so that they may make informed decisions about integrating and using consortium data.
EDACC is computing data quality metrics on all submitted and published ChIP-seq, DNaseI and MeDIP data, and is providing the metrics to mapping centers and downstream users. The format for providing this information is still being discussed.

Below we give a brief description of each of the methods currently being implemented by EDACC.

The metrics are motivated by the following illustration, which shows a 36kb stretch of short-read sequence tags (each represented by a tiny rectangle) mapped to the human genome for four separate DNaseI experiments, the top two in one cell-type, the bottom two in another. Total genome-wide sequencing depth is similar in all four cases, but one can observe differences in the apparent signal-to-noise ratio between them, in terms of the degree to which tags are concentrated in peaks versus the background. The metrics discussed below each assign a single number to each genome-wide data set to gauge the signal enrichment, a process which has typically, up to this point, been made by eye. The values for one of these metrics, SPOT, are displayed for each of the four experiments below.

Methods

Three of the four methods are of a similar flavor: call regions of significant tag enrichment and then measure the percentage of all mapped tags that fall in the enriched regions. The premise is that datasets that are of higher signal quality should have a higher percentage of tags falling in the enriched regions. The first three methods, Simple Poisson, Hotspot/SPOT and FindPeaks, differ only in the methods used to call enriched regions. A fourth method, iROC, aims to quantify the separation in the distribution of signal and noise reads.

Simple Poisson

In this method, the genome is tiled into 1kb non-overlapping windows, and a Poisson distribution is fit to the counts of mapped tags falling in each 1kb window. Windows whose counts fall above the p-value threshold of 0.01 for the fitted Poisson are deemed to be enriched. The quality value is the percentage of all tags falling in the enriched windows.

Hotspot/SPOT

The Hotspot algorithm is a scan statistic that gauges enrichment with a z-score based on the binomial distribution. The method assigns significance based on a local background estimate, thus correcting for local elevation of tag levels due to segmental duplications, copy number events, etc. Regions of significant enrichment are called "hotspots," and can range in size from 10bp to several kb. A description of the method, as well as downloadable software can be found here.

For a given dataset, hotspots are called and the quality value is calculated as the percentage of all tags in hotspots (SPOT, or Signal Portion Of Tags).

FindPeaks

FindPeaks is an algorithm for calling peaks in ChIP-seq data. It includes FDR estimates for thresholding peaks. The software is maintained in SourceForge as part of the Vancouver Short Read Analysis Package.

For a given dataset, peaks are called at an FDR threshold of 1% and the quality value returned in this case is the percentage of all tags falling in those peaks.

iROC

This method is based on the assumption that mapped reads will fall into two distinct populations, signal and noise, that can each be described by a Poisson distribution. As with the Simple Poisson method, the genome is binned into 1kb windows. Joint Poissons are then fit to the tag counts per window.

The quality value returned by this method is the quantification of the separation of the two distributions, as follows.

A given threshold (dashed vertical line, below) applied to the continuous signal and noise distributions define four theoretical populations, representing true and false positives, and true and false negatives.

Thus a theoretical value of sensitivity and specificity can be plotted on an ROC curve for the continuum of thresholds. The greater the separation between the two distributions, the greater the integral of the ROC curve (iROC), which is the quality value thus returned.

Preliminary Metric Comparisons

Below is a scatter plot showing the relationship between the Poisson, FindPeaks and SPOT metrics on a sample of 36 different DNaseI experiments. Overall these metrics track each other in a very consistent manner.

Below is a similar plot relating iROC scores to each of the other three metrics. As mentioned above, iROC is based on a fundamentally different concept than the other three metrics, which are all very similar in spirit. The plot below shows that the experiments that score highest in each of the enrichment-based metrics also score near the top in iROC. But there are also high-iROC experiments that score low in the other metrics. We are currently investigating further to determine what features of the data iROC is capturing relative to the others.