The Wheeler lab designs algorithms and statistical methods for problems motivated by biological data. We are particularly focused on the annotation of biological sequences and the accompanying problem of searching for similar sequences within large-scale biological sequence databases, but our work also addresses infectious disease, soil microbiomes, transposable elements, and neuroscience.
Projects in our group range from statistical modeling of biological sequence families, to text indexing and bounded search algorithms, to low-level software optimization, to FPGAs, to Deep Neural Networks, to Natural Language Processing, to web services; from genomes, to proteins, to multiomics, to drug discovery, to animal tracking and behavior.
We are housed in the Pharmacy Practice & Science Department at the University of Arizona, in Tucson. Awesome mountains, great weather, terrific research environment.
A tool for protein sequence database search that is both very fast and very sensitive. Roddy, J.R., Rich, D.H., and Wheeler, T.J. 2024
Neural representation for rapid protein search prefiltering. Olson, D.R., Demekas, D., Colligan, T., and Wheeler, T.J. 2024
Better Alignments with Translated HMMER - Frameshift Aware Traslated Hidden Markov Models for the Annotation of Protien Coding DNA. Krause, G., Shands, W., and Wheeler, T.J. 2024
Deep learning-based Identity Preserving Labeled-Object Multi-Animal Tracking. Robinson, I., Insel, N., and Wheeler, T.J. 2023
An open source workflow for virtually screening billions of molecules for binding affinity to protein targets. Venkatraman, V., Colligan, T.H., Lesica, G.T, Olson, D.R., Gaiser J., Copeland, C., and Wheeler, T.J., and Roy, A. 2022.
DISCO Implements Sound Classification Obediently. Colligan, T., Irish, K., Emlen, D.J., and Wheeler, T.J. 2022
An Open Source Library for Visualizing Biological Sequence Annotation Roddy, J., Lesica, G., and Wheeler, T.J. 2021
A fast, AVX2-accelerated FM-index library for hyper-fast string pattern matching in nucleotide and amino sequences. Open source, C library. Anderson, T. and Wheeler, T.J. 2021
A tool for locating and labeling tandemly-repetitive sequence. Olson, D. and Wheeler, T.J. 2018
A tool for splice-aware multiple sequence alignment. Nord, A. and Wheeler, T.J. 2018
Biological sequence analysis using profile hidden Markov models. Eddy, S.R. and Wheeler, T.J. 2013
A DNA-DNA sequence homology search tool based on profile hidden Markov models, in the HMMER3 framework. Wheeler, T.J. and Eddy, S.R. 2012
A Mesquite package for fast neighbor-joining phylogeny inference. Wheeler, T.J. and Maddison, D.R. 2010
Software for large-scale neighbor-joining phylogeny inference. Wheeler ,T.J. 2009
A Mesquite package for multiple sequence alignment. Wheeler, T.J. and Maddison, D.R. 2009
Software for multiple sequence alignment by optimally aligning alignments. Wheeler, T.J. and Kececioglu. J.D. 2006
A Mesquite package for aligning sequence data. Maddison, D.R., Wheeler, T.J., and Maddison, W.P. 2006
Software for optimally aligning alignments. Starrett, D.M., Wheeler, T.J., and Kececioglu, J.D. 2005
A tool for resolving putative sequencing errors in single-pass cDNA, based on genomic sequence. Hayden, C.A. and Wheeler, T.J. 2005
An open repository of MD simulations for proteins, with or without ligands, generated by the worldwide community of researchers Roy, A., Ward, E., …, Wheeler, T.J. 2024-.
A place where researchers working on Transposable Elements (TEs) can catalog available online resources. It is organized as a collection of wiki pages, enabling community contribution and collaboration. The TE Hub Consortium, Elliott T., Heitkam T., Hubley R., Quesneville H., Suh A., Wheeler T.J. 2021-.
A tool for creating informative, interactive logos representing sequence alignments and profile hidden Markov models. Wheeler, T.J., Clements, J., Finn, R.D. 2013-.
A Database of Repetitive DNA Based on Profile Hidden Markov Models. Hubley, R. Smit, A.F.A, …, Wheeler, T.J.. 2012-.
Arizona TRIF initiative (2023-2024)
Virtual drug screening will dramatically expand the diversity of explored candidate drugs, while reducing time and cost of discovery. We will extend development of AI methods to predict good drug candidates for a target protein. Models will explore billions of candidate synthesizable drugs, and will complete development of a first-in-class repository of drug interaction simulations.
DOE PerCon SFA (2023-2026)
(PI: Robert Egbert @ Pacific Northwest National Laboratory )
Collaborating across highly integrated institutions, PerCon SFA scientists are exploring how environmental niches can be sculpted using the mechanisms of genome reduction and metabolic addiction to drive secure rhizosphere community design for robust biomass cropping in challenging environments. Our group works to develop improved Machine Learning methods to recognize similarities between proteins.
NIH 1U24HG010136 (2018-2028)
(PI: Arian Smit, co-PI: Robert Hubley @ Institute for Systems Biology )
Most of the vertebrate genome finds its ultimate origin in transposable elements (TEs), and the thorough annotation of TEs is a critical aspect of genome annotation pipelines. The goal of the proposed effort is to develop the infrastructure of Dfam to expand to 1000s of genomes, and to establish a self-sustaining TE Data Commons dependent on limited centralized curation. We will also improve the quality of repeat annotation through development of methods for more reliable alignment adjudication, to expand approaches to visualization of this complex data type, and to improve the modeling of TE subfamilies.
NIH 1R21HL172036 (2023-2025)
(PI: Jason Karnes, @ UArizona)
Genetic variation in immune-related genes, as in the human leukocyte antigen (HLA) locus, plays a pervasive role across organ systems. HLA variation, called HLA alleles, is used to match organ donors, and has been associated with adverse drug reactions (ADRs), cancer, infections, and cardiovascular and neurologic diseases. However, most studies focus on the impact of HLA variation on specific immune-mediated diseases; the broader influence of HLA variation across all human disease has not been investigated in depth. In Aim 1, HLA alleles will be determined using whole genome sequence data, and PheWAS will be deployed in AllofUs to explore ancestral differences in HLA/phenotype associations. In Aim 2 we will develop Machine Learning strategies to explore the effect of HLA allele interactions on disease, and explore the potential for recognizing pleiotropic influences of HLA alleles.
NIH R01HG002939 (2022-2027)
(Multi-PI w/: Arian Smit and Robert Hubley @ Institute for Systems Biology )
Most of the vertebrate genome finds its ultimate origin in transposable elements (TEs), and their annotation is crucial for genome sequence analysis and our understanding of TEs unrivaled impact on genome biology and evolution. Their de novo discovery and description has become a bottleneck in the genome analysis of the thousands of new species sequenced every year. In this effort, we wll make foundational changes to the way RepeatMasker adjudicates TE alignments and assigns confidence to annotations, develop two paths to improving the generation of new TE libraries through the use of multi-species genome alignments and ancestral reconstructions, along with core algorithmic changes to our RepeatModeler discovery tool.
NIH R21HG012283 (2022-2024)
The goal of this study is to catalog the tissue- and development-specific splicing patterns of dual-coding exon variants, and to computationally explore their mechanisms of control and expected functional impact.
NSF 1933305 (2022-2024)
We will develop algorithms for improved identification of peptides from tandem mass spectromentry datasets.
NIH 1R01GM132600 (2019-2023)
Alignment of biological sequences is a key step in understanding their evolution, function, and patterns of activity. We will develop Machine Learning approaches to improve both accuracy and speed of highly-sensitive sequence alignment. To improve accuracy, we will develop methods based on both hidden Markov models and Artificial Neural Networks to reduce erroneous annotation caused by (1) the existence of low complexity and repetitive sequence and (2) the overextension of alignments of true homologs into unrelated sequence. We also address the issue of annotation speed, with development of a custom Deep Learning architecture designed to very quickly filter away large portions of candidate sequence comparisons prior to the relatively-slow sequence-alignment step.
DOE DE-SC0021216 (2020-2023)
(Joint with Jason McDermott @ Pacific Northwest National Laboratory)
Communities of microbes in soil are key contributors to the plant-soil dynamic that supports production of food and fuel crops, for example driving nitrogen fixation, drought resistance, and nutrient cycling. The composition and interactions of these communities are of great importance, but these are often difficult to fully characterize due to challenges with sample acquisition, data processing, and community complexity and diversity. The effort supported by this grant will improve understanding of soil microbial communities through a combination of improved engineering for prototyped sequence annotation software, novel approaches in Deep Learning sequence annotation, and a new Bayesian method for integrating data from multiple high-throughput omics sources (particularly genomics and metabolomics).
NIH 1R15MH117611 (2019-2022)
(PI: Nathan Insel @ University of Montana - Psychology)
The goals of this project relate to social cognition in Degus (highly social rodents). The Wheeler lab role involves development of machine learning methods for tracking of multiple animals in video and behavior classification in those videos.
NIH 1R15HG009570-01 (2017-2020)
Fast and sensitive sequence database search is fundamental to modern molecular biology. This proposal describes a research plan to improve the accuracy of annotation of protein-coding content in sequenced genomes and metagenomic datasets. The research builds on established sequence database search software that employs probabilistic models to increase sensitivity through greater statistical power and ability to better model family complexity. The probabilistic models are called profile hidden Markov models (profile HMMs), and the software is HMMER.
The taxonomic breadth of sequenced datasets requires methods with the power to detect remote sequence similarity; raw data and sequencing errors demand models that recognize frameshifts and splice sites; and the massive scale of datasets demands that implementations be fast. My group will develop novel models for frameshifts and splice site detection in profile HMM homology search. Direct modeling of these features within search software effectively uses homology to guide ORF/gene prediction, which in turn leads to better homology detection. Through a combination of new algorithms and application of existing approaches, these models will be fast enough to use for large-scale annotation, such as in the EMBL European Bioinformatics Metagenomics Portal.
P20GM103546 NIH CoBRE (2017-2020)
With the continued explosive increase in genomic and metagenomic sequencing, the community requires effective and increasingly scalable methods to more fully decode, organize, and exploit sequence data. Accurate and complete annotation of a genomic dataset, based on sequence homology, is a critical first step in understanding its content. This annotation often boils down to sequence database search – the act of searching in a large sequence dataset to find sequences that are similar to known elements.
We aim to develop methods that will substantially improve the speed of sequence comparison with profile hidden Markov models, meeting the need for methods that are fast enough to accommodate large-scale databases, while still powerful enough to detect remote sequence similarity. We will implement these methods in the HMMER codebase, focusing on three complementary target optimizations. Specifically, the aims are:
P20GM103546 NIH CoBRE (2016-2017) Pilot grant
Sequence comparison is fundamental to modern molecular biology. Much effort has been expended in the development of methods to make comparison faster and more sensitive. Though the risk of false annotation is understood, the extent and key causes have only been lightly addressed. Two primary sources of false annotation are (1) the overextension of alignments of true homologs into unrelated sequence, and (2) the existence of low complexity sequence, especially when the query and target share similar patterns of repetitive sequence, such as atgatgatgatgatg (‘atg’, repeated). In our experience, these issues together cause >2% of all annotation to be incorrect, even with current strategies for avoiding the resulting errors. Furthermore, these strategies are themselves responsible for some loss in sensitivity to remote homology. This study will lay the groundwork for addressing both sources of false annotation. Specifically: