Computational Genomics at the University of Montana
The Wheeler lab designs algorithms and statistical methods for
sequence analysis in computational genomics, and develops
high-quality implementations of those methods for incorporation
into highly-used software pipelines and web services. We are
particularly focused on the alignment of biological sequences
and the accompanying problem of searching for similar
sequences within large-scale biological sequence databases.
We fixate on speed (developing algorithmic and engineering
techniques to accelerate sequence alignment), sensitivity
(developing models to improve recognition of similarity between
highly divergent sequences, even in the face of sequencing error),
and reliability (developing approaches that reduce the
propensity of highly sensitive methods to make erroneous claims,
especially in the context of confounding factors).
This work largely relates to
profile hidden Markov models.
We apply our methods to the analysis and understanding of
interesting biological phenomena, particularly
transposable elements, conserved
non-coding regions, and surprising features of genomes such
as dual-coding exons.
The lab has open positions for motivated postdocs and PhD students.
Please drop me a line if you'd like to discuss possibilities.
15 years of PhosphoSitePlus®: integrating post-translationally modified sites, disease variants and isoforms (2018)
Hornbeck, P.V., Kornhauser, J.M., Latham, V., Murray, B., Nandhikonda, V., Nord, A., Skrzypek, E., Wheeler, T., Zhang, B., Gnad, F.
Nucleic Acids Research, gky1159.
ULTRA: A Model Based Tool to Detect Tandem Repeats (2018)
Olson, D. and Wheeler, T.J.
Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB), 37-46.
Splice-Aware Multiple Sequence Alignment of Protein Isoforms [MIRAGE] (2018)
Nord, A., Hornbeck, P., Carey, K., and Wheeler, T.J.
Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB), 200-210.
Integration of protein phosphorylation, acetylation, and methylation data sets to outline lung cancer signaling networks (2018)
Grimes, M., Hall, B., Foltz, L., Levy, T., Rikova, K., Gaiser, J., Cook, W.,
Smirnova, E., Wheeler, T., Clark, N.R., Lachmann, A., Zhang, B., Hornbeck, P., Maayan, A., and Comb, M.
Science Signaling, 11:eaaq1087.
Hubley, R., Finn, R.D., Clements, J., Eddy, S.R., Jones, T.A.,
Bao, W., Smit, A.F.A, and Wheeler, T.J.
Nucleic Acids Research, 46:gkv1272.
A call for benchmarking transposable element annotation methods. (2015)
Hoen, D.R., Hickey, G., Bourque, G., Casacuberta, J., Cordaux, R., Feschotte, C.,
Fiston-Lavier, A-S., Hua-Van, A., Hubley, R., Kapusta, A., Lerat, E., Maumus, F.,
Pollock, D.D., Quesneville, H., Smit, A., Wheeler, T.J., Bureau, T.E., Blanchette, M.
Mobile DNA, 6(1):1-9.
Estimating the accuracy of multiple alignments and its use in parameter advising. (2012)
DeBlasio, D., Wheeler, T., and Kececioglu, J.
Proceedings of the 16th Conference on Research in Computational Molecular Biology (RECOMB),
Springer-Verlag Lecture Notes in Bioinformatics, 7262: 45-59.
Complete nucleomorph genome sequence of the non-photosynthetic alga Cryptomonas paramecium
reveals a core nucleomorph gene set. (2010)
Tanifuji, G. Onodera, N.T., Wheeler, T.J., Dlutek, M., Donaher, N., and Archibald, J.M.
Genome Biology and Evolution, 3: 44-54.
Aligning protein sequences with predicted secondary structure. (2010)
Kececioglu, J., Kim, E., and Wheeler, T.
Journal of Computational Biology, 17(3): 561-580.
Proceedings of the 9th Workshop on Algorithms in Bioinformatics (WABI), 375-389.
Learning models for aligning protein sequence with predicted secondary structure. (2009)
Kim, E., Wheeler, T.J., and Kececioglu, J.D.
Proceedings of the 13th Conference on Research in Computational
Molecular Biology (RECOMB), Springer-Verlag Lecture Notes in Bioinformatics, 5541: 586-605.
Multiple alignment by aligning alignments. (2007)
Wheeler, T.J. and Kececioglu, J.D.
Proceedings of the 15th ISCB Conference on Intelligent Systems for Molecular Biology,
Bioinformatics, 23: i559-i568.
Adaptive protein evolution and regulatory divergence in Drosophila. (2006)
Good, J.M., Hayden, C.A., and Wheeler T.J.
Molecular Biology and Evolution, 23(6): 1101-1103.
Evaluating and improving cDNA sequence quality with cQC. (2005)
A tool for resolving putative sequencing errors in single-pass cDNA, based on genomic sequence.
Hayden, C.A. and Wheeler, T.J.
Dfam: sustainable growth, curation support, and improved quality for mobile element annotation
Most of the vertebrate genome finds its ultimate origin in transposable elements (TEs), and the thorough annotation of TEs is a critical aspect of genome annotation pipelines. Currently, the data used for this annotation is dominated by a single database, Repbase, whose restrictive license impedes coalescence on a central standardized resource. We have developed an innovative open access database (Dfam) that yields substantial gains in TE annotation quality and provides numerous novel features in support of community TE curation. We will grow and improve Dfam to create a sustainable, standardized, and open system for TE family data. To achieve this, we will develop Dfam’s infrastructure to scale to 1000s of genomes, develop improved methods for computing and representing the complex relationships between TE instances, and develop a framework for aiding curators in developing and sharing their TE libraries.
Methods for fast bio-sequence comparison with profile hidden Markov models
[P20GM103546 NIH CoBRE]
With the continued explosive increase in genomic and metagenomic sequencing, the community requires effective and increasingly scalable methods to more fully decode, organize, and exploit sequence data. Accurate and complete annotation of a genomic dataset, based on sequence homology, is a critical first step in understanding its content. This annotation often boils down to sequence database search – the act of searching in a large sequence dataset to find sequences that are similar to known elements.
The challenges posed by increased sequencing activity are particularly prevalent in metagenomics datasets, which are both highly diverse and very large. The high level of diversity means that accurate annotation requires extremely sensitive sequence analysis methods, while the scale necessitates ever increasing software speed. This proposal describes a plan to develop methods that will substantially improve the speed of sequence comparison with profile hidden Markov models, meeting the need for methods that are fast enough to accommodate large-scale databases, while still powerful enough to detect remote sequence similarity. We will implement these methods in the HMMER codebase, focusing on three complementary target optimizations. Specifically, the aims are:
- Index-based acceleration of the key filter stage of HMMER.
- Sparse completion of Forward/Backward Dynamic Programming matrix.
- Acceleration with FPGA configurable hardware.
All of these aims emphasize speed, with the approaches hedging and complementing each other. Overall, we expect 10x or better speed gain, with little-to-no loss in sensitivity for even for very sensitive profile HMM searches. This improvement to annotation speed will fundamentally alter the way in which large-scale sequence datasets are annotated, resulting in greatly improved understanding of genomic datasets.
Improved protein-DNA models for translated sequence search with profile HMMs
Fast and sensitive sequence database search is fundamental to modern molecular biology. This proposal describes a research plan to improve the accuracy of annotation of protein-coding content in sequenced genomes and metagenomic datasets. The research builds on established sequence database search software that employs probabilistic models to increase sensitivity through greater statistical power and ability to better model family complexity. The probabilistic models are called profile hidden Markov models (profile HMMs), and the software is HMMER.
The taxonomic breadth of sequenced datasets requires methods with the power to detect remote sequence similarity; raw data and sequencing errors demand models that recognize frameshifts and splice sites; and the massive scale of datasets demands that implementations be fast. My group will develop novel models for frameshifts and splice site detection in profile HMM homology search. Direct modeling of these features within search software effectively uses homology to guide ORF/gene prediction, which in turn leads to better homology detection. Through a combination of new algorithms and application of existing approaches, these models will be fast enough to use for large-scale annotation, such as in the EMBL European Bioinformatics Metagenomics Portal.