Query Help Documentation Download Guan Lab

1. Motivation

Traditional functional relationship networks consider a gene as a single entity without considering possible isoforms resulting from alternative splicing. Therefore, they are not able to reveal functional relationships between isoforms.

The question we ask in our work is that can we model functional relationships between isoforms so as to build networks at the isoform level. As an example (see the figure right) both ENSA and FAM120B have two isoforms but traditional gene-level network can only provide a single probability between them (shadow box, left panel). At the isoform level, there are in total 4 possible isoform pairs whose functional relationship strengths can be different. We are interested in modeling and predicting the functional relationships between all possible isoform pairs within a gene pair.

zoomin

Recent development of several high-throughput technologies provide rich isoform-level information, including RNA-seq, predicted protein-protein docking scores and amino acid compositions. These data can be used to extract isoform-level feature (x) data for learning a model y = f(x). On the other hand, the 'gold standard (y)' that consists of known functional relationships, is historically defined at the gene-level, which prevents the use of any traditional supervised learning method from being used to build isoform-level networks. Our solution to this challenge is to use a multiple instance learning (MIL) strategy that is able to interrogate isoform-level feature data and existing gene-level functional relationships to build an isoform-level network.

2. Bayesian Network-based Multiple Instance Learning

We developed a Bayesian Network-based multiple instance learning (MIL) algorithm to establish functional relationship models at the isoform level.

Pulpit rock

Left figure: Illustration of multiple instance learning (MIL): Functional relationship is traditionally defined between gene pairs without considering the functional relationships at the isoform pair level. By taking the isoform pairs into account, a gene pair is claimed to be positive if and only if at least one of its isoform pair is functionally related, and a gene pair is treated as negative if none of their isoform pairs is functionally related. A. Illustration of a functionally related gene pair: Gene A (with 3 isoforms) and Gene B (with 2 isoforms). In this case, there are in total 6 possible isoform pairs. Among these, two isoform pairs are functionally related (solid red), whereas the other 4 isoform pairs have no functional relationship (dashed light blue). B. Demonstration of a negative gene pair. None of the isoform pairs is functionally related. C. In traditional classification problems, the positive examples, defined as known functionally related gene pairs, are separated from the negative examples (unrelated pairs) by a classifier. D. In our strategy, gene pairs are considered as 'bags', each of which may contain one to many isoform pairs, defined as 'instances'. A positive bag (a co-functional gene pair) must have at least one of its instances (isoform pair) being functionally related. These co-functional isoform pairs are called 'witnesses' (pairs in red). All instances (isoform pairs) in a negative bag (an un-related gene pair) must not be functionally related. A classifier is trained under the above defined constraints of positive and negative bags.

Overview of our approach for predicting and validating the isoform-level functional relationship network. We first collected genomic features from different data sources including RNA-seq, exon array, protein docking and pseudo amino acid compositions. For each dataset, pair-wise values for isoform pairs were calculated. We generated the gene-level gold standard, which contains positive gene pairs (co-annotated to the same biological function or pathway) and negative gene pairs (not co-annotated to any function/pathway) using the Gene Ontology (GO), KEGG and BioCyc databases. For model development and validation, we partitioned our gold standard into two disjoint graphs, serving as the training set and the test set, respectively. Our MIL algorithm, which uses a Bayesian network classifier as its base learner, was run on the training data to build a classification model. In each iteration, only the 'witnesses' (red-colored pairs) in positive bags and the highest scored instance in negative bags are used for training the model to achieve maximal discriminativeness. Therefore, the classification model was established at the isoform level, instead of at the gene level. After convergence, the final classifier was used to predict the probability of functional interactions for the independent test set. We finally validated the accuracy of our model through simulation, cross-validation, as well as biological examples.

workflow

3. FAQ

3.1 Which reference genome was used for building the isoform-level network?

We used NCBI (gene build 37.2) as well as the corresponding gene annotation file for calculating isoform expression and building the isoform-level network. This reference represents the validated isoforms.

3.2 How can I know the gene that an isoform belong to in the isoform-level network?

Move your mouse over the isoform node you are interested in, the gene name will show up.

3.3 From the web server, I can only identify a limited number of functionally connected isoforms/genes, why is that?

Since we predicted functional relationships for all possible isoform pairs, the resulting isoform network is huge including around 400 million pairs. Considering the speed of isoform network query, we, for each isoform, only included its top 25 neighrbors with the highest probabilities. For the sake of visualization, for each query, we only display links between the query and its top 25 neighbors as well as the links between its neighbors.

3.4 What tools are used for displaying the networks?

The d3 (data-driven document) library, which is a javascript-based package, is the key tool used for displaying networks.

3.5 Why do the links below the threshold still show up when I use the slider in the visualization page to adjust the network?

The probability threshold does not apply to the links between your query and its neighbors, but is only effective to between-neighbor links.

3.6 What do the last two columns mean in the .txt file of network data and top connections downloaded from the network visualization page?

They are the number of isoforms of the genes.

3.7 Can I save the isoform network graph as high-quality figures?

With the browser such as Chrome or Firefox, you can choose File->Print, and then choose to "save as PDF" to save the page as a high-quality figure.


4. Contact

If you have questions, please contact Dr. Yuanfang Guan gyuanfan@umich.edu, or Dr. Hongdong Li hongdong@umich.edu, or Dr. Gilbert Omenn gomenn@med.umich.edu.


5. Reference

Hong-Dong Li, Rajasree Menon, Ridvan Eksi, Aysm Guerler, Gilbert S. Omenn, Yuanfang Guan, Modeling the functional relationship network at the isoform level through heterogeneous data integration.