Katie L Burnham,
Julian C Knight
Genome Medicine 2016;8:129 (DOI: 10.1186/s13073-016-0384-y)
We introduce XGR (eXploring Genomic Relations), released as an R package (http://cran.r-project.org/package=XGR) and web-app (http://galahad.well.ox.ac.uk/XGR) enabling downstream knowledge discovery from genomic summary data. Biological interpretation of genomic summary data resulting from such as GWAS and eQTL mapping is a major bottleneck in human disease genomics, calling for efficient and integrative tools designed to resolve this problem. In the past, this kind of research has been less appreciated but is becoming increasingly important with the growing availability of ontology, annotation, and network data essential for precision interpretation.
XGR is designed to make a user-defined list of genes or SNPs (or genomic regions) more interpretable by comprehensively utilising ontology and network information to generate more informative results than conventional methods. XGR is unique in supporting a broad range of ontologies (including knowledge of biological and molecular functions, pathways, diseases and phenotypes - in both human and mouse) and different types of networks (including functional, physical and pathway interactions).
In this user manual, you will be guided through the steps necessary to use this tool. After going through particularly the
Showcases section which includes several demos with published data, you will be able to: 1) perform enrichment analysis using either built-in or custom ontologies, 2) calculate semantic similarity between genes (or between SNPs) based on their ontology annotation profiles, 3) identify a gene subnetwork given your query list of (significant) genes, SNPs or genomic regions, and 4) interpret genomic regions using co-localised functional genomic annotations and using nearby gene annotations by ontologies. For end-users who are unfamiliar with R, please refer to our user-friendly web app.1
R (http://www.r-project.org) is a language and environment for statistical computing and graphics. The latest version on different platforms can be installed:
Mac OS X (download), and
Linux (see below).
Assume you have a
ROOT (sudo) privilege:
Assume you do
NOT have a
ROOT (sudo) privilege and want R installation under your home directory (
For installation of the
XGR package, please follow the instructions below:
XGR (the latest stable release version deposited into CRAN from Bioconductor):
We are grateful to have your feedbacks particularly bugs. To help streamline bug reports and fixes, please file an issue here.
All source data are represented uniformly as well-documented RData-formatted files, taking advantage of the R software environment and its infrastructure packages such as igraph (Csardi and Nepusz 2006) and GenomicRanges (Lawrence et al. 2013). These data are subject to regular updates, and are also regularly supplemented to keep pace with the explosive nature of big data in modern genome biology.
|Function||Gene Ontology Molecular Function||GOMF|
|Function||Gene Ontology Biological Process||GOBP|
|Function||Gene Ontology Cellular Component||GOCC|
|Phenotype||Human Phenotype Phenotypic Abnormality||HPPA|
|Phenotype||Human Phenotype Mode of Inheritance||HPMI|
|Phenotype||Human Phenotype Clinical Modifier||HPCM|
|Phenotype||Human Phenotype Mortality Aging||HPMA|
|Trait||Experimental Factor Ontology||EF|
|Druggability||DGI druggable gene categories||DGIdb|
|Domain||SCOP domain superfamilies||SF|
|Domain||Pfam domain families||Pfam|
|MsigDB||Hallmark gene sets||MsigdbH|
|MsigDB||Chromosome and cytogenetic band positional gene sets||MsigdbC1|
|MsigDB||Chemical and genetic perturbation gene sets||MsigdbC2CGP|
|MsigDB||All pathway gene sets||MsigdbC2CPall|
|MsigDB||Canonical pathway gene sets||MsigdbC2CP|
|MsigDB||KEGG pathway gene sets||MsigdbC2KEGG|
|MsigDB||Reactome pathway gene sets||MsigdbC2REACTOME|
|MsigDB||BioCarta pathway gene sets||MsigdbC2BIOCARTA|
|MsigDB||Transcription factor target gene sets||MsigdbC3TFT|
|MsigDB||microRNA target gene sets||MsigdbC3MIR|
|MsigDB||Cancer gene neighborhood gene sets||MsigdbC4CGN|
|MsigDB||Cancer module gene sets||MsigdbC4CM|
|MsigDB||GO biological process gene sets||MsigdbC5BP|
|MsigDB||GO molecular function gene sets||MsigdbC5MF|
|MsigDB||GO cellular component gene sets||MsigdbC5CC|
|MsigDB||Oncogenic signature gene sets||MsigdbC6|
|MsigDB||Immunologic signature gene sets||MsigdbC7|
|Category||Source||Genomic Annotations||Identifier Codes|
|TFBS||ENCODE||Cell-type-specific TFBS uniformly identified||Uniform_TFBS|
|TFBS||ENCODE||Cell-type-specific clustered TFBS||ENCODE_TFBS_ClusteredV3_CellTypes|
|DHS||ENCODE||Cell-type-specific DHS uniformly identified||Uniform_DNaseI_HS|
|DHS||ENCODE||Cell-type-specific clustered DHS||ENCODE_DNaseI_ClusteredV3_CellTypes|
|Histone Modifications||ENCODE||Cell-type-specific histone modifications||Broad_Histone|
|Histone Modifications||ENCODE||Cell-type-specific histone modifications||SYDH_Histone|
|Histone Modifications||ENCODE||Cell-type-specific histone modifications||UW_Histone|
|Expressed Enhancers||FANTOM5||Cell-type-specific expressed enhancers||FANTOM5_Enhancer_Cell|
|Expressed Enhancers||FANTOM5||Tissue-specific expressed enhancers||FANTOM5_Enhancer_Tissue|
|Expressed Enhancers||FANTOM5||Extensive enhancers||FANTOM5_Enhancer_Extensive|
|Expressed Enhancers||FANTOM5||Full collections of enhancers||FANTOM5_Enhancer|
|Genome Segmentations||ENCODE||Combined genome segmentation for GM12878||Segment_Combined_Gm12878|
|Genome Segmentations||ENCODE||Combined genome segmentation for H1-hESC||Segment_Combined_H1hesc|
|Genome Segmentations||ENCODE||Combined genome segmentation for HeLa S3||Segment_Combined_Helas3|
|Genome Segmentations||ENCODE||Combined genome segmentation for HepG2||Segment_Combined_Hepg2|
|Genome Segmentations||ENCODE||Combined genome segmentation for HUVEC||Segment_Combined_Huvec|
|Genome Segmentations||ENCODE||Combined genome segmentation for K562||Segment_Combined_K562|
|Conserved TFBS||TRANSFAC||PWM human/mouse/rat conserved TFBS||TFBS_Conserved|
|miRNA regulatory sites||TargetScan||miRNA regulatory sites||TS_miRNA|
|Cancer mutations||TCGA||Tumor-type-specific exome mutations||TCGA|
SNP annotations are based on the Experimental Factor Ontology (EFO). EFO standardises GWAS traits from the NHGRI GWAS Catalog using well-defined terms (Welter et al. 2014). Knowledge of co-inherited variants is also used to include additional SNPs that are in Linkage Disequilibrium (LD) with GWAS lead SNPs. LD SNPs are calculated based on the 1000 Genomes Project data (1000 Genomes Project Consortium 2012). LD SNPs are defined to be any SNPs having R2>0.8 with GWAS lead SNPs.
|AFR||African||1000 Genomes Project|
|AMR||Admixed American||1000 Genomes Project|
|EAS||East Asian||1000 Genomes Project|
|EUR||European||1000 Genomes Project|
|SAS||South Asian||1000 Genomes Project|
XGR support networks of different interaction types (functional, physical, and pathway-derived), of varying interaction quality (highest, high, and medium) and of two interaction directions (directed versus undirected). These are mainly sourced from the STRING database (Szklarczyk et al. 2015) and the Pathway Commons database (Cerami et al. 2011). STRING is a meta-integration of undirect interactions from a functional aspect, while Pathway Commons mainly contains both undirect and direct interactions from a physical/pathway aspect. In addition to interaction type, users can choose the interactions of varying quality:
|Identifier Code||Interaction (type and quality)||Database|
|STRING_high||Functional interactions (with high confidence scores>=700)||STRING|
|STRING_medium||Functional interactions (with medium confidence scores>=400)||STRING|
|PCommonsUN_high||Physical/undirect interactions (with references & >=2 sources)||Pathway Commons|
|PCommonsUN_medium||Physical/undirect interactions (with references & >=1 sources)||Pathway Commons|
|PCommonsDN_high||Pathway/direct interactions (with references & >=2 sources)||Pathway Commons|
|PCommonsDN_medium||Pathway/direct interactions (with references & >=1 sources)||Pathway Commons|
|Identifier Code||Interaction (source)||Database|
|PCommonsDN_Reactome||Pathway/direct interactions (only from Reactome)||Pathway Commons|
|PCommonsDN_KEGG||Pathway/direct interactions (only from KEGG)||Pathway Commons|
|PCommonsDN_HumanCyc||Pathway/direct interactions (only from HumanCyc)||Pathway Commons|
|PCommonsDN_PID||Pathway/direct interactions (only from PID)||Pathway Commons|
|PCommonsDN_PANTHER||Pathway/direct interactions (only from PANTHER)||Pathway Commons|
|PCommonsDN_ReconX||Pathway/direct interactions (only from ReconX)||Pathway Commons|
|PCommonsDN_PhosphoSite||Pathway/direct interactions (only from PhosphoSite)||Pathway Commons|
|PCommonsDN_CTD||Pathway/direct interactions (only from CTD)||Pathway Commons|
The functions in the package
XGR are categorised into five groups according to the tasks they complete. They are summarised below.
Enrichment functions are supposed to do enrichment analysis based on several statistical tests (either Fisher’s exact test or hypergeometric/binomial test). The test is to estimate significance of overlaps between, for example, an input group of genes and a group of genes annotated by an ontology term. By default, all annotatable genes are used as the test background but can be specified by the user. If ontology terms are organised as a tree-like structure, this ontology structure can also be taken into account to produce more informative results. Particularly for a non-structure ontologies (eg a collection of pathways), a filtering procedure is also developed to generate non-redundant but informative results.
xEnricherGenes: conducts gene-based enrichment analysis given a list of genes and the ontology in query. It supports two types of ontologies: 1) structured ontologies including Gene Ontology (Ashburner et al. 2000), Disease Ontology (Schriml et al. 2012), and Phenotype Ontologies in human and mouse (Köhler et al. 2013; Smith and Eppig 2009), and 2) non-structured ontologies/categories; for example, a collection of pathways, gene expression signatures, transcription factor targets, and gene druggable categories.
xEnricherSNPs: conducts SNP-based enrichment analysis using GWAS Catalog traits mapped to Experimental Factor Ontology (Welter et al. 2014). Inclusion of additional SNPs that are in linkage disequilibrium (LD) with input SNPs are also allowed for enrichment analysis.
xEnricherYours: conducts custom-based enrichment analysis provided with an entity file and an annotation file.
xEnricher: acts as a template for enrichment analysis. It is an internal function upon which high-level functions (ie
xEnrichViewer: views enrichment results as a data frame that is also useful for the subsequent file saving.
xEnrichConciser: makes enrichment results much clearer by removing redundant terms. A redundant term is claimed if its overlapped part with a more significant term meets both criteria: covers more than 95% of this redundant term and also more than 50% of the more significant term. In doing so, only non-redundant but informative terms will be left.
xEnrichBarplot: visualises enrichment results using a barplot.
xEnrichDAGplot: visualises enrichment results using a DAG plot. This function is only useful for tree-like structured ontologies. Significant terms (of interest) are highlighted by box-shaped nodes, and the others by ellipse nodes.
xEnrichCompare: compares enrichment results using side-by-side barplots. This function is useful when comparing enrichment results for different inputs but based on the same ontology.
xEnrichDAGplotAdv: visualises comparative enrichment results using a DAG plot. This function takes input the output of the function
xEnrichCompare to further illustrate differences and commonalities of comparative enrichment results in the context of ontology tree.
Similarity functions serve to conduct similarity analysis calculating semantic similarity - a type of comparison to assess the degree of relatedness between two entities (eg genes or SNPs) based on their annotation profiles (by ontology terms) (Pesquita et al. 2009). To do so, information content (IC) of a term is first defined to measure how informative a term is to being used for annotating genes: –log10(frequency of genes annotated to this term). Similarity between two terms are then measured based on IC, usually at the most informative common ancester (MICA). Finally, similarity between two entities (eg genes) are derived from pairwise term similarity using best-matching based methods: average, maximum, and complete.
xCircos: visualises the similarity results using a circos plot. The degree of similarity between SNPs (or genes) is visualised by the colour of links. This function can be used either to visualise the most similar links or to plot links involving an input SNP (or gene).
Network functions are supposed to identify a gene subnetwork from a gene interaction network with node/gene significant information. The node/gene information can be directly provided (eg user-defined genes with the significance level; p-values or FDR); see the function
xSubneterGenes. The node/gene information can also be indirectly provided, for example, nearby genes of user-defined SNPs with the significance level (eg GWAS reported p-values; see the function
xSubneterSNPs), or more generally, nearby genes of user-defined genomic regions with the significance level (eg differentially methylated regions together with FDR; see the function
xSubneterGR). From a gene interaction network with nodes labelled with gene information, the algorithm searching for a maximum-scoring gene subnetwork has been reported in our previous publication (Fang and Gough 2014), briefed as follows:
score transformation, that is, given the threshold of tolerable p-value, nodes with p-values below this threshold (nodes of interest) are scored positively, and negative scores for nodes with threshold-above p-values (intolerable);
subnetwork identification, that is, to find an interconnected gene subnetwork enriched with positive-score nodes, but allowing for a few negative-score nodes as linkers;
controlling the subnetwork size, that is, an iterative procedure is provided to finetune tolerable thresholds for identifying the gene subnetwork with a desired number of nodes.
xSubneterGenes: takes as input a list of user-defined genes with the significance level (p-values), superposes these genes onto a gene interaction network, and outputs a maximum-scoring gene subnetwork that contains as many most significant (highly scored) genes as possible but also a few lesser significant (scored) genes as linkers.
xSNP2GeneScores: takes as input a list of user-defined SNPs with the significance level (eg GWAS reported p-values), and defines and scores nearby genes that take into account the distance to and the significance of input SNPs.
xSubneterSNPs: identifies a gene subnetwork that is likely modulated by input SNPs and/or their Linkage Disequilibrium (LD) SNPs, including two major steps. The first step is to use
xSNP2GeneScores for defining and scoring nearby genes that are located within distance window of input and/or LD SNPs. The second step is to use
xSubneterGenes for identifying a maximum-scoring gene subnetwork.
xGR2GeneScores: takes as input a list of user-defined genomic regions (GR) with the significance level (eg p-values), and defines and scores nearby genes that take into account the distance to and the significance of input GR.
xSubneterGR: identifies a gene subnetwork that is likely modulated by input genomic regions (GR), including two major steps. The first step is to use
xGR2GeneScores for defining and scoring nearby genes that are located within distance window of input genomic regions. The second step is to use
xSubneterGenes for identifying a maximum-scoring gene subnetwork.
Annotation functions are supposed to interpret a user-defined list of genomic regions either via looking at nearby gene annotations by ontologies or via looking at co-localised functional genomic annotations.
xGRviaGeneAnno: conducts region-based enrichment analysis using nearby gene annotations, including two major steps. The first step is to define nearby genes within the maximum distance gap between genomic regions and gene location. The second step is to use
xEnricherGenes for enrichment analysis to identify enriched terms.
xGRviaGenomicAnno: conducts region-based enrichment analysis using functional genomic annotations. Enrichment analysis is based on binomial test for estimating the significance of overlaps at three levels of resolution. Genomic annotations cover a broad spectrum of genetic and epigenetic knowledge, including functional genomic data experimentally generated by the consortia such as ENCODE (Bernstein et al. 2012), FANTOM5 (Forrest et al. 2014), BLUEPRINT Epigenome (Adams et al. 2012), TCGA (Kandoth et al. 2013) and Roadmap Epigenomics (Kundaje et al. 2015), and comparative genomic data predicted by the computational methods.
Notably, the resolution of overlaps being tested can be:
bases at the base resolution (by default),
regions at the region resolution,
hybrid at the base-region hybrid resolution (that is, data at the region resolution but annotation/background at the base resolution).
Generally speaking, if genomic annotations are exclusive to each other, the resolution can be either
hybrid. If genomic annotations are somehow inclusive to each other, it is better to choose
bases). If regions being analysed are SNPs, then the results are the same (irrespective of specified resolution).
xGRviaGenomicAnnoAdv: conducts region-based enrichment analysis using functional genomic annotations. Enrichment analysis is achieved by comparing the observed overlaps against the expected overlaps which are estimated from the null distribution. The null distribution is generated via sampling, that is, randomly generating samples for data genomic regions from background genomic regions. Background genomic regions can be provided by the user; by default, the annotatable genomic regions will be used. Since sampling is time-consuming, the parallel computation is also supported for Unix-like computers.
Infrastructure functions are essential as they deal with infrastructure including built-in data loading, ontology annotation propagation, calculation of term-term semantic similarity, graph conversions and visualisations, and define and score genes likely modulated by SNPs.
xRDataLoader: serves as hub for loading built-in data about genes, SNPs, ontologies and annotations.
xDAGanno: induces annotations to the ontology root according to the true-path rule.
xDAGpropagate: propagates annotations (together with numeric info on such as p-values) to the ontology root according to the true-path rule.
xDAGsim: calculates semantic similarity between terms, and returns a network with nodes for terms and edges for pair-wise semantic similarity between terms.
xConverter: converts an object between graph classes.
xVisNet: visualises the graph in different layouts.
xVisKernels: visualises distance kernel functions.
xSNPscores: calculates scores for lead or LD SNPs.
xSNPlocations: extracts genomic locations for SNPs.
xSNP2nGenes: defines nearby genes from SNPs.
xSparseMatrix: creates a sparse matrix for an input file.
xSM2DF: creates a data frame (with three columns) from a (sparse) matrix.
xLiftOver: lifts genomic intervals from one genome build to another.
xGRsampling: generates random samples for data genomic regions from background genomic regions.
xColormap: defines a colormap (color palette).
xGRscores: calculates scores for input genomic regions (GR).
xGR2nGenes: defines nearby genes from input genomic regions (GR).
xGR: creates a GRanges object given a list of genomic regions (GR).
xCheckParallel: checks whether parallel computing should be used and how.
xSymbol2GeneID: converts gene symbols to entrez geneid.
xGeneID2Symbol: converts gene symbols to entrez geneid.
xDefineNet: defines a gene network sourced from the STRING database or the Pathway Commons database.
xSimplifyNet: simplifies a network by keeping root-tip shortest paths only.
xHeatmap: draws a heatmap.
Auxiliary functions provide supplementary supports during the package development, such as code debugging and documentation creating.
An essential step of data analysis is how to make sense of a gene (or SNP) list in a biologically-meaningful way. Genes (and/or SNPs) may be identified from differential expression studies, GWAS and eQTL mappings. We showcase the applications using several published datasets. The users are encouraged to adapt the provided codes to analyse their own datasets.
Demo Series 1: using the XGR package to interpret summary data resulting from differential expression studies
Demo Series 2: using the XGR package to interpret GWAS summary data
Demo Series 3: using the XGR package to interpret eQTL summary data
Below is the list of references that XGR stands on:
1000 Genomes Project Consortium. 2012. “An integrated map of genetic variation from 1,092 human genomes.” Nature 491 (7422): 56–65. https://doi.org/10.1038/nature11632.
Adams, D, L Altucci, Se Antonarakis, J Ballesteros, S Beck, A Bird, C Bock, et al. 2012. “BLUEPRINT to decode the epigenetic signature written in blood.” Nature Biotechnology 30 (3): 224–26. https://doi.org/10.1038/nbt.2153.
Ashburner, M, C A Ball, J A Blake, D Botstein, H Butler, J M Cherry, A P Davis, et al. 2000. “Gene ontology: tool for the unification of biology.” Nat Genet 25 (1): 25–29. https://doi.org/10.1038/75556.
Bernstein, Bradley E, Ewan Birney, Ian Dunham, Eric D Green, Chris Gunter, and Michael Snyder. 2012. “An integrated encyclopedia of DNA elements in the human genome.” Nature 489 (7414): 57–74. https://doi.org/10.1038/nature11247.
Cerami, E. G., B. E. Gross, E. Demir, I. Rodchenkov, O. Babur, N. Anwar, N. Schultz, G. D. Bader, and C. Sander. 2011. “Pathway Commons, a web resource for biological pathway data.” Nucleic Acids Research 39 (Database): D685–D690. https://doi.org/10.1093/nar/gkq1039.
Csardi, G, and T Nepusz. 2006. “The igraph software package for complex network research.” InterJournal Complex Systems 1695: 1695.
Fang, Hai, and Julian Gough. 2014. “The ’dnet’ approach promotes emerging research on cancer patient survival.” Genome Medicine 6 (8): 64. https://doi.org/10.1186/s13073-014-0064-8.
Forrest, Alistair R. R., Hideya Kawaji, Michael Rehli, J. Kenneth Baillie, Michiel J. L. de Hoon, Vanja Haberle, Timo Lassmann, et al. 2014. “A promoter-level mammalian expression atlas.” Nature 507 (7493): 462–70. https://doi.org/10.1038/nature13182.
Kandoth, Cyriac, Michael D McLellan, Fabio Vandin, Kai Ye, Beifang Niu, Charles Lu, Mingchao Xie, et al. 2013. “Mutational landscape and significance across 12 major cancer types.” Nature 502 (7471): 333–9. https://doi.org/10.1038/nature12634.
Köhler, Sebastian, Sandra C Doelken, Christopher J Mungall, Sebastian Bauer, Helen V Firth, Isabelle Bailleul-Forestier, Graeme C M Black, et al. 2013. “The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data.” Nucleic Acids Research 42 (Database issue): D966–74. https://doi.org/10.1093/nar/gkt1026.
Kundaje, Anshul, Wouter Meuleman, Jason Ernst, Misha Bilenky, Angela Yen, Alireza Heravi-Moussavi, Pouya Kheradpour, et al. 2015. “Integrative analysis of 111 reference human epigenomes.” Nature 518: 317–30. https://doi.org/10.1038/nature14248.
Lawrence, Michael, Wolfgang Huber, Hervé Pagès, Patrick Aboyoun, Marc Carlson, Robert Gentleman, Martin T Morgan, and Vincent J Carey. 2013. “Software for Computing and Annotating Genomic Ranges.” PLoS Computational Biology 9 (8): 1–10. https://doi.org/10.1371/journal.pcbi.1003118.
Pesquita, C, D Faria, A O Falcao, P Lord, and F M Couto. 2009. “Semantic similarity in biomedical ontologies.” PLoS Comput Biol 5 (7): e1000443. https://doi.org/10.1371/journal.pcbi.1000443.
Schriml, L M, C Arze, S Nadendla, Y W Chang, M Mazaitis, V Felix, G Feng, and W A Kibbe. 2012. “Disease Ontology: a backbone for disease semantic integration.” Nucleic Acids Res 40 (Database issue): D940–6. https://doi.org/10.1093/nar/gkr972.
Smith, C L, and J T Eppig. 2009. “The Mammalian Phenotype Ontology: enabling robust annotation and comparative analysis.” Wiley Interdiscip Rev Syst Biol Med 1 (3): 390–99. https://doi.org/10.1002/wsbm.44.
Szklarczyk, Damian, Andrea Franceschini, Stefan Wyder, Kristoffer Forslund, Davide Heller, Jaime Huerta-cepas, Milan Simonovic, et al. 2015. “STRING v10 : protein – protein interaction networks , integrated over the tree of life.” Nucleic Acids Res 43 (Database): D447–D452. https://doi.org/10.1093/nar/gku1003.
Welter, Danielle, Jacqueline MacArthur, Joannella Morales, Tony Burdett, Peggy Hall, Heather Junkins, Alan Klemm, et al. 2014. “The NHGRI GWAS Catalog, a curated resource of SNP-trait associations.” Nucleic Acids Research 42 (D1): 1001–6. https://doi.org/10.1093/nar/gkt1229.