Showing data distribution using Gene Plot

From wubrowse wiki
Jump to: navigation, search

Friday, February 3, 2012


The Gene Plot function has been reworked - new features added, known bugs removed, interface is more informative.

Enhancements/changes on the four graph types are summarized here:

Type 1 - previously only "average value curve" is drawn. Now the value distribution over each data points are displayed. Type 2 - same as before, but now named as "Spaghetti plot". Type 3 - same as before, five functional parts of genes are plotted individually. Same as Type 1, value distribution over data points is displayed. Type 4 - hierarchical or k-means clustering, now visualizes negative values correctly.

In Type 1 and 3, value distribution is presented either as boxplot (using R) or "quartile & extremes" curves (using Google Chart service). Here's the details.

To run Gene Plot as standalone app, click Apps » and select Gene Plot from the apps list:

Showing data distribution using Gene Plot 0 1.png

The Gene Plot panel is then displayed:

Showing data distribution using Gene Plot 1 2.png

At Step 0, enter a list of genes or coordinates into the text area. The sample list of cytochrome P450 genes is used here:

CYP4Z1

CYP2A7

CYP2A6

CYP3A4

CYP1A1

CYP4V2

CYP51A1

CYP2C19

CYP26B1

CYP11B2

CYP24A1

CYP4B1

CYP2C8

At Step 1, I select a heatmap track for demo. Click Select a heatmap track » and select the track named "H3K9me3 vHMEC":

Showing data distribution using Gene Plot 2 3.png

Then go to Step 2 and look at its interface, by default the first plot type is chosen (quartiles & extremes):

Showing data distribution using Gene Plot 3 1.png

Check the checkbox on the bottom "plot average values", then press button to generate the plot:

Showing data distribution using Gene Plot 4 3.png

In this graph, the histone data over the P450 gene bodies are summarized into same number of data points (number of 50). Histone data distribution over sampling points are presented as 6 curves: min/max, lower/upper quartile, median and average. The average curve can be removed by unchecking the checkbox.

This graph is interactive, move cursor over to get details (data point #, curve type, and value). Lower/upper quartiles represent 25/75 percentiles, so between them are 50% of the data values. In this example the the average and median curves don't differ a lot. But in cases of outliers the median/average will show great difference.

This graph is generated by Google Chart service, which is fancy and interactive. R software rendering used to be merely fallback mechanism, but now it gets some special highlights.

At Step 3, select R rendering from the drop-down menu. Notice how Step 2 panel updates:

Showing data distribution using Gene Plot 5 4.png

Two new options show up when using R for graph type 1. With above configurations generate the plot:

Showing data distribution using Gene Plot 6 859440836.png

Now boxplots are used instead of quartile curves. However the curve for average value is still there. You can turn it off by unchecking the checkbox. Graph generated by R is still image and is not interactive.

The graph Type 3 is similar with Type 1 in using quartile curves or boxplots to represent data distribution. As an example, select genomic feature track "vertebrate PhyloP" (sequence conservation data of human genome against vertebrate genomes), and generate Type 3 plot for the short list of P450 genes:

Showing data distribution using Gene Plot 7 1193131242.png

The plot shows gene exons have higher score, which well correlates with the idea that coding regions tend to be conserved. Average value curves are not shown here. The graph will look like following when generated by Google Chart:

Showing data distribution using Gene Plot 8 3.png