In May 2022, the lab moved to the Pharmacy Practice & Science Department at the University of Arizona, in Tucson. Awesome mountains, great weather, terrific research environment.
The Wheeler lab designs algorithms and statistical methods for problems motivated by biological data. We are particularly focused on the annotation of biological sequences and the accompanying problem of searching for similar sequences within large-scale biological sequence databases, but our work also addresses infectious disease, soil microbiomes, transposable elements, and neuroscience.
Projects in our group range from statistical modeling of biological sequence families, to text indexing and bounded search algorithms, to low-level software optimization, to FPGAs, to Deep Neural Networks, to Natural Language Processing, to web services; from genomes, to proteins, to multiomics, to drug discovery, to animal tracking and behavior.
We have open positions for motivated postdocs and PhD students. Please drop me a line if you’d like to learn more or discuss possibilities. I’m particularly interested in hearing from people who have ideas about problems that they’d like to tackle.
An open source workflow for virtually screening billions of molecules for binding affinity to protein targets Venkatraman, V., Colligan, T.H., Lesica, G.T, Olson, D.R., Gaiser J., Copeland, C., Wheeler, T.J., and Roy, A. 2022.
Frameshift Aware Traslated Hidden Markov Models for the Annotation of Protien Coding DNA (pronounced “framer”) Krause, G., Shands, W., and Wheeler, T.J. 2022
Pruned Forward-Backward implementation of profile HMM alignment. Rich, D., Lesica, G., and Wheeler, T.J. 2021
An Open Source Library for Visualizing Biological Sequence Annotation Roddy, J., Lesica, G., and Wheeler, T.J. 2021
A fast, AVX2-accelerated FM-index library for hyper-fast string pattern matching in nucleotide and amino sequences. Open source, C library. Anderson, T. and Wheeler, T.J. 2021
TE Hub is a place where researchers working on Transposable Elements (TEs) can catalog available online resources. It is organized as a collection of wiki pages, enabling community contribution and collaboration. 2020
A network interface for protein-protein interaction networks filtered based on post-translational modifications. Grimes, M., Hal, B., Foltz, L., Levy, T., Rikova, K., Gaiser, J., Cook, W., Smirnova, E., Wheeler, T.J., Clark, N.R. and Lachmann, A. 2018
A tool for locating and labeling tandemly-repetitive sequence. Olson, D. and Wheeler, T.J. 2018
A tool for splice-aware multiple sequence alignment. Nord, A. and Wheeler, T.J. 2018
A tool for creating informative, interactive logos representing sequence alignments and profile hidden Markov models. Wheeler, T.J., Clements, J., Finn, R.D. 2013
Biological sequence analysis using profile hidden Markov models. Eddy, S.R. and Wheeler, T.J. 2013
A DNA-DNA sequence homology search tool based on profile hidden Markov models, in the HMMER3 framework. Wheeler, T.J. and Eddy, S.R. 2012
A Database of Repetitive DNA Based on Profile Hidden Markov Models. Wheeler, T.J., Clements, J., Eddy, S.R., Hubley, R., Jones, T.A., Jurka, J., Smit, A.F.A, Finn, R.D. 2012
A Mesquite package for fast neighbor-joining phylogeny inference. Wheeler, T.J. and Maddison, D.R. 2010
Software for large-scale neighbor-joining phylogeny inference. Wheeler ,T.J. 2009
A Mesquite package for multiple sequence alignment. Wheeler, T.J. and Maddison, D.R. 2009
Software for multiple sequence alignment by optimally aligning alignments. Wheeler, T.J. and Kececioglu. J.D. 2006
A Mesquite package for aligning sequence data. Maddison, D.R., Wheeler, T.J., and Maddison, W.P. 2006
Software for optimally aligning alignments. Starrett, D.M., Wheeler, T.J., and Kececioglu, J.D. 2005
A tool for resolving putative sequencing errors in single-pass cDNA, based on genomic sequence. Hayden, C.A. and Wheeler, T.J. 2005
NIH R01HG002939 (2022-2027)
(Multi-PI w/: Arian Smit and Robert Hubley @ Institute for Systems Biology )
Most of the vertebrate genome finds its ultimate origin in transposable elements (TEs), and their annotation is crucial for genome sequence analysis and our understanding of TEs unrivaled impact on genome biology and evolution. Their de novo discovery and description has become a bottleneck in the genome analysis of the thousands of new species sequenced every year. In this effort, we wll make foundational changes to the way RepeatMasker adjudicates TE alignments and assigns confidence to annotations, develop two paths to improving the generation of new TE libraries through the use of multi-species genome alignments and ancestral reconstructions, along with core algorithmic changes to our RepeatModeler discovery tool.
NSF 1933305
We will develop algorithms for improved identification of peptides from tandem mass spectromentry datasets.
DOE DE-SC0021216 (2020-2023)
(Joint with Jason McDermott @ Pacific Northwest National Laboratory)
Communities of microbes in soil are key contributors to the plant-soil dynamic that supports production of food and fuel crops, for example driving nitrogen fixation, drought resistance, and nutrient cycling. The composition and interactions of these communities are of great importance, but these are often difficult to fully characterize due to challenges with sample acquisition, data processing, and community complexity and diversity. The effort supported by this grant will improve understanding of soil microbial communities through a combination of improved engineering for prototyped sequence annotation software, novel approaches in Deep Learning sequence annotation, and a new Bayesian method for integrating data from multiple high-throughput omics sources (particularly genomics and metabolomics).
NIH 1R01GM132600 (2019-2023)
Alignment of biological sequences is a key step in understanding their evolution, function, and patterns of activity. We will develop Machine Learning approaches to improve both accuracy and speed of highly-sensitive sequence alignment. To improve accuracy, we will develop methods based on both hidden Markov models and Artificial Neural Networks to reduce erroneous annotation caused by (1) the existence of low complexity and repetitive sequence and (2) the overextension of alignments of true homologs into unrelated sequence. We also address the issue of annotation speed, with development of a custom Deep Learning architecture designed to very quickly filter away large portions of candidate sequence comparisons prior to the relatively-slow sequence-alignment step.
NIH 1R15MH117611 (2019-2022)
(PI: Nathan Insel @ University of Montana - Psychology)
The goals of this project relate to social cognition in Degus (highly social rodents). The Wheeler lab role involves development of machine learning methods for tracking of multiple animals in video and behavior classification in those videos.
NIH 1U24HG010136-01 (2018-2023)
(PI: Arian Smit, co-PI: Robert Hubley @ Institute for Systems Biology )
Most of the vertebrate genome finds its ultimate origin in transposable elements (TEs), and the thorough annotation of TEs is a critical aspect of genome annotation pipelines. The goal of the proposed effort is to develop the infrastructure of Dfam to expand to 1000s of genomes, and to establish a self-sustaining TE Data Commons dependent on limited centralized curation. We will also improve the quality of repeat annotation through development of methods for more reliable alignment adjudication, to expand approaches to visualization of this complex data type, and to improve the modeling of TE subfamilies.
NIH 1R15HG009570-01 (2017-2020)
Fast and sensitive sequence database search is fundamental to modern molecular biology. This proposal describes a research plan to improve the accuracy of annotation of protein-coding content in sequenced genomes and metagenomic datasets. The research builds on established sequence database search software that employs probabilistic models to increase sensitivity through greater statistical power and ability to better model family complexity. The probabilistic models are called profile hidden Markov models (profile HMMs), and the software is HMMER.
The taxonomic breadth of sequenced datasets requires methods with the power to detect remote sequence similarity; raw data and sequencing errors demand models that recognize frameshifts and splice sites; and the massive scale of datasets demands that implementations be fast. My group will develop novel models for frameshifts and splice site detection in profile HMM homology search. Direct modeling of these features within search software effectively uses homology to guide ORF/gene prediction, which in turn leads to better homology detection. Through a combination of new algorithms and application of existing approaches, these models will be fast enough to use for large-scale annotation, such as in the EMBL European Bioinformatics Metagenomics Portal.
P20GM103546 NIH CoBRE (2017-2020)
With the continued explosive increase in genomic and metagenomic sequencing, the community requires effective and increasingly scalable methods to more fully decode, organize, and exploit sequence data. Accurate and complete annotation of a genomic dataset, based on sequence homology, is a critical first step in understanding its content. This annotation often boils down to sequence database search – the act of searching in a large sequence dataset to find sequences that are similar to known elements.
We aim to develop methods that will substantially improve the speed of sequence comparison with profile hidden Markov models, meeting the need for methods that are fast enough to accommodate large-scale databases, while still powerful enough to detect remote sequence similarity. We will implement these methods in the HMMER codebase, focusing on three complementary target optimizations. Specifically, the aims are:
P20GM103546 NIH CoBRE (2016-2017) Pilot grant
Sequence comparison is fundamental to modern molecular biology. Much effort has been expended in the development of methods to make comparison faster and more sensitive. Though the risk of false annotation is understood, the extent and key causes have only been lightly addressed. Two primary sources of false annotation are (1) the overextension of alignments of true homologs into unrelated sequence, and (2) the existence of low complexity sequence, especially when the query and target share similar patterns of repetitive sequence, such as atgatgatgatgatg (‘atg’, repeated). In our experience, these issues together cause >2% of all annotation to be incorrect, even with current strategies for avoiding the resulting errors. Furthermore, these strategies are themselves responsible for some loss in sensitivity to remote homology. This study will lay the groundwork for addressing both sources of false annotation. Specifically: