Multiple sequence alignment of datasets containing many thousands of sequences is a challenging problem with applications in phylogeny estimation, protein structure and function prediction, taxon identification of metagenomic data, etc. However, few methods can analyze large datasets, and none have been shown to have good accuracy on datasets with more than about 10,000 sequences, especially if the sequence datasets have evolved with high rates of evolution.
In this talk, I will present a new method to obtain highly accurate estimations of large-scale multiple sequence alignments and phylogenies. The basic idea is to use an ensemble of Hidden Markov Models (HMMs) to represent a "seed alignment", and then align all the remaining sequences to the seed alignment. Our method, UPP, returns very accurate alignments, and trees on these alignments are also very accurate - even on datasets with as many as 1,000,000 sequences, or datasets that contain many fragmentary sequences. Furthermore, UPP is both fast and very scalable, so that the analysis of the 1-million taxon dataset took only 24 hours using 12 cores and small amounts of memory. Finally, this Ensemble of HMMs technique improves the accuracy of methods for other bioinformatics problems, including phylogenetic placement and taxon identication of metagenomic data.
Tandy Warnow is the Founder Professor of Bioengineering and Computer Science at the University of Illinois at Urbana-Champaign. Her research combines mathematics, computer science, and statistics to develop improved models and algorithms for reconstructing complex and large-scale evolutionary histories in both biology and historical linguistics. Tandy received her PhD in Mathematics at UC Berkeley under the direction of Gene Lawler, and did postdoctoral training with Simon Tavare and Michael Waterman at USC. She received the National Science Foundation Young Investigator Award in 1994, the David and Lucile Packard Foundation Award in Science and Engineering in 1996, a Radcliffe Institute Fellowship in 2006, and a Guggenheim Foundation Fellowship for 2011. Her current research focuses on phylogeny and alignment estimation for very large datasets (10,000 to 1,000,000 sequences), estimating species trees and phylogenetic networks from collections of gene trees, and metagenomics.