phylogenomic analysis of a protein sequence | forester/rio home | pfam | hmmer | atv | eddy lab home | genetics | washington university medical school |
The RIO server allows to perform phylogenomic analyses of
protein sequences.
The input consists of a amino acid query sequence, the species of the query, and
a Pfam domain of the query.
The output is a list of sequences in the Pfam domain alignment ordered according
to a bootstrap confidence value for being orthologous toward the query (default setting).
Besides bootstrap values based on orthology, values based on "subtree-neighborings",
"super-orthology", and "ultra-paralogy" are calculated as well (see below
for definitions of these concepts).
In addition, RIO allows for output in form of an annotated gene tree, using our tree
viewer ATV. (The button to start
the ATV applet is at the very end of the output.)
Neighbor joining gene trees are calculated based on (bootstrap resampled) precalculated
pairwise ML distances together with ML distances to the query sequence.
Duplications are inferred by comparing the gene trees to a trusted species tree
using our SDI algorithm.
This example shows how to use the RIO server to analyze a protein sequence, in this case the 3-isopropylmalate dehydratase large subunit (EC 4.2.1.33) from Haemophilus influenzae. The following three fields need to be filled in.
"protein sequence query": paste this sequence into the field "protein sequence query" (the fasta ">" line will be ignored):
> sp|P44968|LEU2_HAEIN 3-isopropylmalate dehydratase large subunit (EC 4.2.1.33) (Isopropylmalate isomerase) (Alpha-IPM isomerase) (IPMI) - Haemophilus influenzae.
AKTLYEKLFDSHIVYEAEGETPILYINRHLIHEVTSPQAFDGLRVANRQVRQVNKTFGTM
DHSISTQVRDVNKLEGQAKIQVLELDKNTKATGIKLFDITTKEQGIVHVMGPEQGLTLPG
MTIVCGDSHTATHGAFGALAFGIGTSEVEHVLATQTLKQARAKSMKIEVRGKVASGITAK
DIILAIIGKTTMAGGTGHVVEFCGEAIQDLSMEGRMTVCNMAIEMGAKAGLIAPDETTFA
YLKDRPHAPKGKDWEDAVAYWKTLKSDDDAQFDTVVTLEAKDIAPQVTWGTNPGQVISVN
ETIPNPQEMADPVQRASAEKALHYIGLEAGTNLKDIKVDQVFIGSCTNSRIEDLRAAAAV
MKGRKKADNVKRILVVPGSGLVKEQAEKEGLDKIFIAAGAEWRNPGCSMCLGMNDDRLGE
WERCASTSNRNFEGRQGRNGRTHLVSPAMAAAAGVFGKFVDIRDVTLN
 
"species of query sequence": RIO uses SWISS-PROT codes to identify species, the code for Haemophilus influenzae is:
HAEIN
 
"pfam domain name": RIO needs to know against which Pfam domain alignment the query sequence should be analyzed. Of course, this domain needs to be present in the query sequence (as determined by a hmmsearch analysis). If this field is left empty, the resulting error message provides a link to run hmmsearch on the query sequence. For the example sequence, the domain is:
Aconitase
 
super-orthologs: Given a completely binary and rooted gene tree with duplication or speciation assigned to each of its internal nodes, two sequences are defined super-orthologous toward each other if and only if each internal node on their connecting path represents a speciation event.
ultra-paralogs: Given a completely binary and rooted gene tree with duplication or speciation assigned to each of its internal nodes, two sequences are defined ultra-paralogous towards each other if and only if the smallest subtree containing them both contains only internal nodes representing duplications.
subtree-neighbors: Given a completely binary and rooted gene tree, the k-subtree-neighbors of a sequence q are defined as all sequences derived from the k-level parent node of q, except q itself (the level of q itself is 0, q's parent is 1, and so forth). The default value of k is 2.
Inspect the currently used species tree with ATV or download it as NHX file.
The advanced options allow to:
If the species of the query sequence is not present in the species tree used by the RIO server (or if the user prefers to use a different tree) it is possible to upload a species tree in NHX format. An example of a species tree in NHX format:
((([&&NHX:S=HUMAN],[&&NHX:S=MOUSE]),[&&NHX:S=YEAST]),[&&NHX:S=ECOLI])
RIO: Zmasek C.M. and Eddy S.R. (2002) RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics, 3:14. [PubMed] [BMC Bioinformatics] [PDF] [software available at http://www.genetics.wustl.edu/eddy/forester/]
Speciation Duplication Inference: Zmasek C.M. and Eddy S.R. (2001) A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics, 17, 821-828. [PubMed] [Bioinformatics] [PDF]
ATV: Zmasek C.M. and Eddy S.R. (2001) ATV: display and manipulation of annotated phylogenetic trees. Bioinformatics, 17, 383-384. [PubMed] [Bioinformatics] [PDF]
Background: Zmasek C.M. (2002) Functional analyses of proteomes by phylogenetic methods. Dissertation at Washington University [PDF]
Email: zmasek@genetics.wustl.edu
WWW: http://www.genetics.wustl.edu/eddy/people/zmasek/
Christian Zmasek
last updated 04/19/02