RIO | Resampled Inference of Orthologs | Help

phylogenomic analysis of a protein sequence | forester/rio home | pfam | hmmer | atv | eddy lab home | genetics | washington university medical school |

Overview

The RIO server allows to perform phylogenomic analyses of protein sequences.
The input consists of a amino acid query sequence, the species of the query, and a Pfam domain of the query.
The output is a list of sequences in the Pfam domain alignment ordered according to a bootstrap confidence value for being orthologous toward the query (default setting). Besides bootstrap values based on orthology, values based on "subtree-neighborings", "super-orthology", and "ultra-paralogy" are calculated as well (see below for definitions of these concepts).
In addition, RIO allows for output in form of an annotated gene tree, using our tree viewer ATV. (The button to start the ATV applet is at the very end of the output.)
Neighbor joining gene trees are calculated based on (bootstrap resampled) precalculated pairwise ML distances together with ML distances to the query sequence.
Duplications are inferred by comparing the gene trees to a trusted species tree using our SDI algorithm.

Example

This example shows how to use the RIO server to analyze a protein sequence, in this case the 3-isopropylmalate dehydratase large subunit (EC 4.2.1.33) from Haemophilus influenzae. The following three fields need to be filled in.

"protein sequence query": paste this sequence into the field "protein sequence query" (the fasta ">" line will be ignored):

> sp|P44968|LEU2_HAEIN 3-isopropylmalate dehydratase large subunit (EC 4.2.1.33) (Isopropylmalate isomerase) (Alpha-IPM isomerase) (IPMI) - Haemophilus influenzae.
AKTLYEKLFDSHIVYEAEGETPILYINRHLIHEVTSPQAFDGLRVANRQVRQVNKTFGTM
DHSISTQVRDVNKLEGQAKIQVLELDKNTKATGIKLFDITTKEQGIVHVMGPEQGLTLPG
MTIVCGDSHTATHGAFGALAFGIGTSEVEHVLATQTLKQARAKSMKIEVRGKVASGITAK
DIILAIIGKTTMAGGTGHVVEFCGEAIQDLSMEGRMTVCNMAIEMGAKAGLIAPDETTFA
YLKDRPHAPKGKDWEDAVAYWKTLKSDDDAQFDTVVTLEAKDIAPQVTWGTNPGQVISVN
ETIPNPQEMADPVQRASAEKALHYIGLEAGTNLKDIKVDQVFIGSCTNSRIEDLRAAAAV
MKGRKKADNVKRILVVPGSGLVKEQAEKEGLDKIFIAAGAEWRNPGCSMCLGMNDDRLGE
WERCASTSNRNFEGRQGRNGRTHLVSPAMAAAAGVFGKFVDIRDVTLN

 

"species of query sequence": RIO uses SWISS-PROT codes to identify species, the code for Haemophilus influenzae is:

HAEIN

 

"pfam domain name": RIO needs to know against which Pfam domain alignment the query sequence should be analyzed. Of course, this domain needs to be present in the query sequence (as determined by a hmmsearch analysis). If this field is left empty, the resulting error message provides a link to run hmmsearch on the query sequence. For the example sequence, the domain is:

Aconitase

 

Definitions

super-orthologs: Given a completely binary and rooted gene tree with duplication or speciation assigned to each of its internal nodes, two sequences are defined super-orthologous toward each other if and only if each internal node on their connecting path represents a speciation event.

ultra-paralogs: Given a completely binary and rooted gene tree with duplication or speciation assigned to each of its internal nodes, two sequences are defined ultra-paralogous towards each other if and only if the smallest subtree containing them both contains only internal nodes representing duplications.

subtree-neighbors: Given a completely binary and rooted gene tree, the k-subtree-neighbors of a sequence q are defined as all sequences derived from the k-level parent node of q, except q itself (the level of q itself is 0, q's parent is 1, and so forth). The default value of k is 2.

Species tree

Inspect the currently used species tree with ATV or download it as NHX file.

Advanced options

The advanced options allow to:

If the species of the query sequence is not present in the species tree used by the RIO server (or if the user prefers to use a different tree) it is possible to upload a species tree in NHX format. An example of a species tree in NHX format:

((([&&NHX:S=HUMAN],[&&NHX:S=MOUSE]),[&&NHX:S=YEAST]),[&&NHX:S=ECOLI])

References

RIO: Zmasek C.M. and Eddy S.R. (2002) RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics, 3:14. [PubMed] [BMC Bioinformatics] [PDF] [software available at http://www.genetics.wustl.edu/eddy/forester/]

Speciation Duplication Inference: Zmasek C.M. and Eddy S.R. (2001) A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics, 17, 821-828. [PubMed] [Bioinformatics] [PDF]

ATV: Zmasek C.M. and Eddy S.R. (2001) ATV: display and manipulation of annotated phylogenetic trees. Bioinformatics, 17, 383-384. [PubMed] [Bioinformatics] [PDF]

Background: Zmasek C.M. (2002) Functional analyses of proteomes by phylogenetic methods. Dissertation at Washington University [PDF]

Contact

Email: zmasek@genetics.wustl.edu

WWW: http://www.genetics.wustl.edu/eddy/people/zmasek/

Christian Zmasek

last updated 04/19/02