Metadata

From wubrowse wiki
Revision as of 15:37, 20 August 2015 by Dli (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Features

  1. Metadata terms are used to annotate browser tracks. Each term has following properties:
    1. a unique integer as ID, optional but recommended
    2. name in the form of string
    3. free text description, html allowed, optional
    4. color in the form of string, optional
  2. Vocabulary is a structured collection of metadata terms.
    1. Vocabulary structure is hierarchical and can go as complex as directed acyclic graph, where one child node can have multiple parent nodes.
    2. Metadata terms (those used for track annotation) can only be leaf-nodes.
    3. Non-leaf-nodes are not considered to be metadata terms and do not have associated IDs, and cannot be used to annotate tracks.
    4. Both leaf- and non-leaf-nodes can be shown in metadata colormap.
    5. Vocabularies always work in the setting of datahub.
      1. Shared vocabulary: independently defined, used as common track attributes by multiple datahubs
      2. Private vocabulary: defined inside one datahub and cannot be used to annotate tracks from other datahubs

Defining metadata vocabulary

Metadata vocabulary is defined as JSON text, following is a basic example:

{
vocabulary:{
    "Cell lines":["H1 ES cell","IMR90 cell","PBMPC",]
}
}

There's just one level of hierarchy, containing three terms. See following sections on how to include this metadata into datahub for track annotation.

This is not a good way to create and use metadata, since terms are expressed by names (in this example, cell lines). The names must be used in exact form wherever they are mentioned. It is inefficient and error-prone.

A better way:

{
vocabulary:{
    "Cell lines":[1,2,3]
},
terms:{
   1:["H1","H1 ES cells"],
   2:["IMR90","human lung fibroblast cells (IMR90)"],
   3:["PBMPC","peripheral blood mononuclear primary cells"],
},
}

In this new definition, terms are identified by integer IDs (1,2,3) and are fully described in the "terms" section. The value is an array. The first element is term name, the second element is description which is optional. In this way, simple numerical IDs can be used instead of term names for tasks such as track annotation.

Example with more than one level of hierarchy:

vocabulary:{
    'Human samples':{
        "ES/iPS cell":{
            "ES cell":[1,2],
            "Derived cell":[3],
            },
        "Primary cells":{
            "Blood":[4],
            "Breast":[5],
            },
    }
},

Shared metadata

Shared metadata is defined in separate files, one file for each vocabulary. See following for examples of publicly shared metadata vocabularies from this Browser.

Using just one shared metadata vocabulary in the datahub

Initially, a shared metadata vocabulary must be included in the datahub, with following JSON definition in the datahub:

{type:"metadata",
vocabulary_file_url:"http://vizhub.wustl.edu/hubSample/hg19/t/metadata",
},

To annotate tracks in the same datahub, use the "metadata" attribute in the track definition. Example of annotation using two terms (addressed by IDs 1 and 2):

metadata:[1,2],

In case of just one term, use:

metadata:3,

Using multiple shared metadata vocabularies in the datahub

It is useful to prepare separate metadata vocabularies for independent aspects. This is especially useful for multi-organism data set curation, where the experimental assay types are common across organisms (e.g. both mouse and human have histone modification chip-seq assays), while each organism use their own tissue/cell line vocabularies.

To include multiple metadata vocabularies in the datahub, do following:

{
type:"metadata",
vocabulary_set:{
        sample:'http://vizhub.wustl.edu/hubSample/hg19/temp/metadata_sample_human',
        assay:'http://vizhub.wustl.edu/hubSample/hg19/temp/metadata_assay',
    },
},

The "vocabulary_set" provides a set of shared vocabularies, indexed by a word (sample and assay in this example).

To annotate track with terms from either or both vocabularies, use the "metadata" attribute in following way:

metadata:{sample:1, assay:2},

Note that the value is now a "hash", with the same "sample" and "assay" as index. The value to such index is list of terms from respective vocabulary. Thus the combined track annotation from multiple sources of vocabulary is achieved with this "index" notation.

Arrays can be used to hold multiple terms from one vocabulary:

metadata:{sample:[1,10],assay:[2,20]},

Note that the index are strictly internal, and will not be reflected anywhere in the browser display. Thus indexes as short as one letter can be used.

Publicly shared metadata vocabularies

These metadata is built by separating various aspects into separate vocabularies. As a result, common concepts can be used across species. Most prominent example is *experimental assays*.

  1. Experimental assays: http://vizhub.wustl.edu/metadata/Experimental_assays
  2. Human samples: http://vizhub.wustl.edu/metadata/human/Samples

Additional vocabularies will be added soon.

Using publicly shared metadata

A datahub with one track annotated by terms from two publicly shared metadata vocabularies that are currently in use by the WashU Browser:

[

{
type:"bedgraph",
url:"http://vizhub.wustl.edu/hubSample/hg19/GSM432686.gz",
name:"bedGraph track A",
mode:"show",
metadata:{sample:[11101],assay:[21004]},
colorpositive:"#ff33cc",
barplot_bg:'#cccccc',
height:40,
},

{
type:"metadata",
vocabulary_set:{
    sample:'http://vizhub.wustl.edu/metadata/human/Samples',
    assay:'http://vizhub.wustl.edu/metadata/Experimental_assays',
},
show_terms: {sample: ["Sample"], assay: ["Assay"]}
},

]

The full example for hg19 genome: http://vizhub.wustl.edu/hubSample/hg19/temp/hubwithPublicMd

Private metadata

Above *shared* metadata vocabularies are defined in separate files, one file for each vocabulary.

A private metadata vocabulary is defined inside a datahub, and can only be used to annotate tracks of that hub.

Following example shows a minimum datahub with a private metadata vocabulary and the annotation to its tracks:

[

{
type:"bedgraph",
url:"http://vizhub.wustl.edu/hubSample/hg19/GSM469970.gz",
name:"bedGraph track A",
mode:"show",
metadata:[1,5],
},

{
type:"metadata",vocabulary:{
    "epigenetic mark":{
        "dna methylation":[1,2],
        "histone mark":[3,4],
        },
    samples:{
        "fetal sample":[5,6],
        "cell lines":[7,8],
        },
     },
     terms:{
        1:["medip-seq"],
        2:["mre-seq"],
        3:["h3k4me1"],
        4:["h3K27ac"],
        5:["fetal skin"],
        6:["fetal heart"],
        7:["H1ES"],
        8:["IMR90"],
     },
     show_terms:{sample:["samples"]}
},

]

Using both shared and private metadata

Following is a minimum datahub using both shared and private metadata vocabularies together.

[

{
type:"bedgraph",
url:"http://vizhub.wustl.edu/hubSample/hg19/GSM469970.gz",
name:"bedGraph track A",
mode:"show",
metadata:{assay:21001,sample:5},
},

{
type:"metadata",
    vocabulary_set:{
         assay:"http://vizhub.wustl.edu/metadata/Experimental_assays",
         sample:{
                vocabulary:{
                     samples:{
                         "fetal sample":[5,6],
                         "cell lines":[7,8],
                     },
                },
                terms:{
                     5:["fetal skin"],
                     6:["fetal heart"],
                     7:["H1ES"],
                     8:["IMR90"],
                },
          },
     },
     show_terms:{sample:["samples"]},
},

]