Home | People | Contact

Wheeler Lab

zoom meeting 2021 breakout room 2018 picnic 2021 Drachman building

In May 2022, the lab moved to the Pharmacy Practice & Science Department at the University of Arizona, in Tucson. Awesome mountains, great weather, terrific research environment.

Computational Genomics, Drug Discovery, and more

The Wheeler lab designs algorithms and statistical methods for problems motivated by biological data. We are particularly focused on the annotation of biological sequences and the accompanying problem of searching for similar sequences within large-scale biological sequence databases, but our work also addresses infectious disease, soil microbiomes, transposable elements, and neuroscience.

Projects in our group range from statistical modeling of biological sequence families, to text indexing and bounded search algorithms, to low-level software optimization, to FPGAs, to Deep Neural Networks, to Natural Language Processing, to web services; from genomes, to proteins, to multiomics, to drug discovery, to animal tracking and behavior.

We have open positions for motivated postdocs and PhD students. Please drop me a line if you’d like to learn more or discuss possibilities. I’m particularly interested in hearing from people who have ideas about problems that they’d like to tackle.


google scholar

G. Krause, W. Shands, and T. J. Wheeler, “Sensitive and error-tolerant annotation of protein-coding DNA with BATH,” bioRxiv, 2024, paper(doi): 10.1101/2023.12.31.573773.
J. W. Roddy, D. H. Rich, and T. J. Wheeler, nail: Software for high-speed, high-sensitivity protein sequence annotation,” bioRxiv, 2024, paper(doi): 10.1101/2024.01.27.577580.
D. R. Olson, D. Demekas, T. Colligan, and T. Wheeler, “NEAR: Neural Embeddings for Amino acid Relationships,” bioRxiv, 2024, paper(doi): 10.1101/2024.01.25.577287.
J. M. Storer, J. A. Walker, J. N. Baker, S. Hossain, C. Roos, T. J. Wheeler, and M. A. Batzer, “Framework of the alu subfamily evolution in the platyrrhine three-family clade of cebidae, callithrichidae, and aotidae,” Genes, vol. 14, no. 2, p. 249, Jan. 2023, paper(doi): 10.3390/genes14020249.
J. Schimunek, P. Seidl, K. Elez, T. Hempel, T. Le, et al., “A community effort in SARS-CoV-2 drug discovery,” Molecular Informatics, Nov. 2023, paper(doi): 10.1002/minf.202300262.
A. Marbut, K. McKinney-Bock, and T. Wheeler, “Reliable measures of spread in high dimensional latent spaces,” in International conference on machine learning, 2023, pp. 23871–23885, paper(doi): 10.48550/arXiv.2212.08172. (extra: https://proceedings.mlr.press/v202/marbut23a/marbut23a.pdf)
T. Colligan, K. Irish, D. J. Emlen, and T. J. Wheeler, DISCO: A deep learning ensemble for uncertainty-aware segmentation of acoustic signals,” PLOS ONE, vol. 18, no. 7, pp. 1–20, Jul. 2023, paper(doi): 10.1371/journal.pone.0288172.
C. Groza, X. Chen, T. J. Wheeler, G. Bourque, and C. Goubert, GraffiTE: A unified framework to analyze transposable element insertion polymorphisms using genome-graphs,” bioRxiv, Sep. 2023, paper(doi): 10.1101/2023.09.11.557209.
T. Anderson and T. Wheeler, “An FPGA-based hardware accelerator supporting sensitive sequence homology filtering with profile hidden markov models,” bioRxiv, pp. 2023–09, 2023, paper(doi): 10.1101/2023.09.20.558701.
A. J. Nord and T. J. Wheeler, “Mirage2’s high-quality spliced protein-to-genome mappings produce accurate multiple-sequence alignments of isoforms,” PLOS ONE, vol. 18, no. 5, p. e0285225, 2023, paper(doi): 10.1371/journal.pone.0285225. (extra: https://wheelerlab.org/publications/2023-Mirage2-Nord/2023-Nord-Mirage2-Supplement.tgz)
G. Glidden-Handgis and T. J. Wheeler, WAS IT A MATch I SAW? Approximate palindromes lead to overstated false match rates in benchmarks using reversed sequences,” bioRxiv, Jun. 2023, paper(doi): 10.1101/2023.06.19.545636.
C. J. Copeland, J. W. Roddy, A. K. Schmidt, P. R. Secor, and T. J. Wheeler, “VIBES: A workflow for annotating and visualizing viral sequences integrated into bacterial genomes,” bioRxiv, pp. 2023–10, 2023, paper(doi): 10.1101/2023.10.17.562434.
N. Altemose, G. A. Logsdon, A. V. Bzikadze, P. Sidhwani, S. A. Langley, et al., “Complete genomic and epigenetic maps of human centromeres,” Science, vol. 376, no. 6588, Apr. 2022, paper(doi): 10.1126/science.abl4178.
S. J. Hoyt, J. M. Storer, G. A. Hartley, P. G. S. Grady, A. Gershman, et al., “From telomere to telomere: The transcriptional and epigenetic state of human repeat elements,” Science, vol. 376, no. 6588, Apr. 2022, paper(doi): 10.1126/science.abk3112.
V. Venkatraman, T. H. Colligan, G. T. Lesica, D. R. Olson, J. Gaiser, C. J. Copeland, T. J. Wheeler, and A. Roy, “Drugsniffer: An open source workflow for virtually screening billions of molecules for binding affinity to protein targets,” Front. Pharmacol., vol. 13, Apr. 2022, paper(doi): 10.3389/fphar.2022.874746.
V. Venkatraman, A. Roy, J. Gaiser, and T. J. Wheeler, “Molecular fingerprints are not useful in large-scale search for similarly active compounds,” bioRxiv, 2022, paper(doi): 10.1101/2022.09.20.508800.
J. F. Brodie, L. F. Henao-Diaz, B. Pratama, C. Copeland, T. Wheeler, and O. E. Helmy, “Fruit size in indo-malayan island plants is more strongly influenced by filtering than by in situ evolution,” The American Naturalist, Nov. 2022, paper(doi): 10.1086/723212.
D. Geller-McGrath, K. M. Konwar, V. P. Edgcomb, M. Pachiadaki, J. W. Roddy, T. J. Wheeler, and J. E. McDermott, MetaPredict: A machine learning-based tool for predicting metabolic modules in incomplete bacterial genomes,” bioRxiv, Dec. 2022, paper(doi): 10.1101/2022.12.21.521254.
R. Hubley, T. J. Wheeler, and A. F. A. Smit, Accuracy of multiple sequence alignment methods in the reconstruction of transposable element families,” NAR Genomics and Bioinformatics, vol. 4, no. 2, May 2022, paper(doi): 10.1093/nargab/lqac040.
J. W. Roddy, G. T. Lesica, and T. J. Wheeler, SODA: A TypeScript/JavaScript library for visualizing biological sequence annotation,” NAR Genomics and Bioinformatics, vol. 4, no. 4, Oct. 2022, paper(doi): 10.1093/nargab/lqac077.
T. Anderson and T. J. Wheeler, “An optimized FM-index library for nucleotide and amino acid search,” Algorithms for Molecular Biology, vol. 16, no. 1, Dec. 2021, paper(doi): 10.1186/s13015-021-00204-6. (extra: http://wheelerlab.org/publications/2021-AWFM-Anderson/Anderson_suppl.tar.gz)
J. Storer, R. Hubley, J. Rosen, T. J. Wheeler, and A. F. Smit, The Dfam community resource of transposable element families, sequence models, and genome annotations,” Mobile DNA, vol. 12, no. 1, p. 2, 2021, paper(doi): 10.1186/s13100-020-00230-y.
K. M. Carey, G. Patterson, and T. J. Wheeler, “Transposable element subfamily annotation has a reproducibility problem,” Mobile DNA, vol. 12, pp. 1759–8753, 2021, paper(doi): 10.1186/s13100-021-00232-4. (extra: http://wheelerlab.org/publications/2020-discordant-CareyPatterson/CareyPatterson_suppl.tar.gz)
K. M. Carey, R. Hubley, G. T. Lesica, D. Olson, J. W. Roddy, J. Rosen, A. Shingleton, A. F. Smit, and T. J. Wheeler, PolyA: A tool for adjudicating competing annotations of biological sequences,” bioRxiv, 2021, paper(doi): 10.1101/2021.02.13.430877.
T. A. Elliott, T. Heitkam, R. Hubley, H. Quesneville, A. Suh, and T. J. Wheeler, “TE hub: A community-oriented space for sharing and connecting tools, data, resources, and methods for transposable element annotation,” Mobile DNA, vol. 12, no. 1, p. 16, 2021, paper(doi): 10.1186/s13100-021-00244-0.
P. R. Secor, E. B. Burgener, M. Kinnersley, L. K. Jennings, V. Roman-Cruz, et al., “Pf bacteriophage and their impact on pseudomonas virulence, mammalian immunity, and chronic infections,” Frontiers in Immunology, vol. 11, 2020, paper(doi): 10.3389/fimmu.2020.00244.
M. Grimes, B. Hall, L. Foltz, T. Levy, K. Rikova, et al., “Integration of protein phosphorylation, acetylation, and methylation data sets to outline lung cancer signaling networks,” Science Signaling, vol. 11, no. 531, p. eaaq1087, 2018, paper(doi): 10.1126/scisignal.aaq1087.
A. Nord, P. Hornbeck, K. Carey, and T. Wheeler, “Splice-aware multiple sequence alignment of protein isoforms,” in Proceedings of the 2018 ACM international conference on bioinformatics, computational biology, and health informatics - BCB 18, 2018, paper(doi): 10.1145/3233547.3233592. (extra: http://wheelerlab.org/publications/Nord18/Nord18.supplement.tar.gz)
P. V. Hornbeck, J. M. Kornhauser, V. Latham, B. Murray, V. Nandhikonda, et al., “15 years of PhosphoSitePlus: Integrating post-translationally modified sites, disease variants and isoforms,” Nucleic Acids Research, vol. 47, no. D1, pp. D433–D441, 2018, paper(doi): 10.1093/nar/gky1159.
D. Olson and T. Wheeler, ULTRA: A model based tool to detect tandem repeats,” in Proceedings of the 2018 ACM international conference on bioinformatics, computational biology, and health informatics - BCB 18, 2018, paper(doi): 10.1145/3233547.3233604. (extra: http://wheelerlab.org/publications/Olson18/Olson18.supplement.tar.gz)
R. Hubley, R. D. Finn, J. Clements, S. R. Eddy, T. A. Jones, W. Bao, A. F. A. Smit, and T. J. Wheeler, “The Dfam database of repetitive DNA families,” Nucleic Acids Research, vol. 44, no. D1, pp. D81–D89, 2015, paper(doi): 10.1093/nar/gkv1272.
R. D. Finn, J. Clements, W. Arndt, B. L. Miller, T. J. Wheeler, F. Schreiber, A. Bateman, and S. R. Eddy, HMMER web server: 2015 update,” Nucleic Acids Research, vol. 43, no. W1, pp. W30–W38, 2015, paper(doi): 10.1093/nar/gkv397.
D. R. Hoen, G. Hickey, G. Bourque, J. Casacuberta, R. Cordaux, et al., “A call for benchmarking transposable element annotation methods,” Mobile DNA, vol. 6, no. 1, 2015, paper(doi): 10.1186/s13100-015-0044-6.
T. J. Wheeler, J. Clements, and R. D. Finn, “Skylign: A tool for creating informative, interactive logos representing sequence alignments and profile hidden markov models,” BMC Bioinformatics, vol. 15, no. 1, 2014, paper(doi): 10.1186/1471-2105-15-7.
T. J. Wheeler, J. Clements, S. R. Eddy, R. Hubley, T. A. Jones, J. Jurka, A. F. A. Smit, and R. D. Finn, Dfam: A database of repetitive DNA based on profile hidden markov models,” Nucleic Acids Research, vol. 41, no. D1, pp. D70–D82, 2013, paper(doi): 10.1093/nar/gks1265. (extra: http://wheelerlab.org/publications/Wheeler13/Wheeler13.supplement.tar.gz)
T. J. Wheeler and S. R. Eddy, nhmmer: DNA homology search with profile HMMs,” Bioinformatics, vol. 29, no. 19, pp. 2487–2489, 2013, paper(doi): 10.1093/bioinformatics/btt403. (extra: http://wheelerlab.org/publications/Wheeler13b/Wheeler13b.supplement.tar.gz)
D. F. DeBlasio, T. J. Wheeler, and J. D. Kececioglu, Estimating the accuracy of multiple alignments and its use in parameter advising,” in Lecture notes in computer science, Springer Berlin Heidelberg, 2012, pp. 45–59.
J. Kececioglu, E. Kim, and T. Wheeler, “Aligning protein sequences with predicted secondary structure,” Journal of Computational Biology, vol. 17, no. 3, pp. 561–580, 2010, paper(doi): 10.1089/cmb.2009.0222.
G. Tanifuji, N. T. Onodera, T. J. Wheeler, M. Dlutek, N. Donaher, and J. M. Archibald, “Complete nucleomorph genome sequence of the nonphotosynthetic alga cryptomonas paramecium reveals a core nucleomorph gene set,” Genome Biology and Evolution, vol. 3, pp. 44–54, 2010, paper(doi): 10.1093/gbe/evq082.
E. Kim, T. Wheeler, and J. Kececioglu, Learning models for aligning protein sequences with predicted secondary structure,” in Lecture notes in computer science, Springer Berlin Heidelberg, 2009, pp. 512–531.
T. J. Wheeler, “Large-scale neighbor-joining with NINJA,” in Lecture notes in computer science, Springer Berlin Heidelberg, 2009, pp. 375–389. (extra: http://wheelerlab.org/files/NINJA2009.pdf)
T. J. Wheeler, “Efficient construction of accurate multiple alignments and large-scale phylogenies,” PhD thesis, The University of Arizona., 2009. (extra: http://wheelerlab.org/files/dissertation_wheeler_final.pdf)
T. J. Wheeler and J. D. Kececioglu, “Multiple alignment by aligning alignments,” Bioinformatics, vol. 23, no. 13, pp. i559–i568, 2007, paper(doi): 10.1093/bioinformatics/btm226.
J. M. Good, C. A. Hayden, and T. J. Wheeler, “Adaptive protein evolution and regulatory divergence in drosophila,” Molecular Biology and Evolution, vol. 23, no. 6, pp. 1101–1103, 2006, paper(doi): 10.1093/molbev/msk002.
C. A. Hayden, T. J. Wheeler, and R. A. Jorgensen, “Evaluating and improving cDNA sequence quality with cQC,” Bioinformatics, vol. 21, no. 24, pp. 4414–4415, 2005, paper(doi): 10.1093/bioinformatics/bti709.
A. D. Cutter, J. M. Good, C. T. Pappas, M. A. Saunders, D. M. Starrett, and T. J. Wheeler, “Transposable element orientation bias in the drosophila melanogaster genome,” Journal of Molecular Evolution, vol. 61, no. 6, pp. 733–741, 2005, paper(doi): 10.1007/s00239-004-0243-0.

Software and Databases

—>>>>> github <<<<<—

nail (nail is an alignment inference tool)

A tool for protein sequence database search that is both very fast and very sensitive. Roddy, J.R., Rich, D.H., and Wheeler, T.J. 2024

NEAR (Neural Embeddings for Amino acid Relationships)

Neural representation for rapid protein search prefiltering. Olson, D.R., Demekas, D., Colligan, T., and Wheeler, T.J. 2024

Better Alignments with Translated HMMER - Frameshift Aware Traslated Hidden Markov Models for the Annotation of Protien Coding DNA. Krause, G., Shands, W., and Wheeler, T.J. 2024

DIPLOMAT (Tracking multiple animals through video recordings)

Deep learning-based Identity Preserving Labeled-Object Multi-Animal Tracking. Robinson, I., Insel, N., and Wheeler, T.J. 2023

Drugsniffer (Billion-scale virtual drug screening)

An open source workflow for virtually screening billions of molecules for binding affinity to protein targets. Venkatraman, V., Colligan, T.H., Lesica, G.T, Olson, D.R., Gaiser J., Copeland, C., and Wheeler, T.J., and Roy, A. 2022.

DISCO (Annotation of sound blocks in audio recordings)

DISCO Implements Sound Classification Obediently. Colligan, T., Irish, K., Emlen, D.J., and Wheeler, T.J. 2022

SODA (A Library for building annotation visualizations)

An Open Source Library for Visualizing Biological Sequence Annotation Roddy, J., Lesica, G., and Wheeler, T.J. 2021

AWFM-index library

A fast, AVX2-accelerated FM-index library for hyper-fast string pattern matching in nucleotide and amino sequences. Open source, C library. Anderson, T. and Wheeler, T.J. 2021

TE Hub

TE Hub is a place where researchers working on Transposable Elements (TEs) can catalog available online resources. It is organized as a collection of wiki pages, enabling community contribution and collaboration. 2020


A network interface for protein-protein interaction networks filtered based on post-translational modifications. Grimes, M., Hal, B., Foltz, L., Levy, T., Rikova, K., Gaiser, J., Cook, W., Smirnova, E., Wheeler, T.J., Clark, N.R. and Lachmann, A. 2018

ULTRA (A tool for labeling tandemly-repetitive DNA)

A tool for locating and labeling tandemly-repetitive sequence. Olson, D. and Wheeler, T.J. 2018


A tool for splice-aware multiple sequence alignment. Nord, A. and Wheeler, T.J. 2018

Skylign Logo server

A tool for creating informative, interactive logos representing sequence alignments and profile hidden Markov models. Wheeler, T.J., Clements, J., Finn, R.D. 2013

HMMER webserver, and HMMER3.1

Biological sequence analysis using profile hidden Markov models. Eddy, S.R. and Wheeler, T.J. 2013

nhmmer (within HMMER3.1)

A DNA-DNA sequence homology search tool based on profile hidden Markov models, in the HMMER3 framework. Wheeler, T.J. and Eddy, S.R. 2012

Dfam (A database of transposable element families)

A Database of Repetitive DNA Based on Profile Hidden Markov Models. Wheeler, T.J., Clements, J., Eddy, S.R., Hubley, R., Jones, T.A., Jurka, J., Smit, A.F.A, Finn, R.D. 2012


A Mesquite package for fast neighbor-joining phylogeny inference. Wheeler, T.J. and Maddison, D.R. 2010


Software for large-scale neighbor-joining phylogeny inference. Wheeler ,T.J. 2009


A Mesquite package for multiple sequence alignment. Wheeler, T.J. and Maddison, D.R. 2009


Software for multiple sequence alignment by optimally aligning alignments. Wheeler, T.J. and Kececioglu. J.D. 2006


A Mesquite package for aligning sequence data. Maddison, D.R., Wheeler, T.J., and Maddison, W.P. 2006


Software for optimally aligning alignments. Starrett, D.M., Wheeler, T.J., and Kececioglu, J.D. 2005

cQC - cDNA Quality Control

A tool for resolving putative sequencing errors in single-pass cDNA, based on genomic sequence. Hayden, C.A. and Wheeler, T.J. 2005


Integrating Deep Learning Methods with Molecular Surface Properties to Improve Drug Screening

Arizona TRIF initiative (2023-2024)

Virtual drug screening will dramatically expand the diversity of explored candidate drugs, while reducing time and cost of discovery. We will extend development of AI methods to predict good drug candidates for a target protein. Models will explore billions of candidate synthesizable drugs, and will complete development of a first-in-class repository of drug interaction simulations.

SFA-Secure Biosystems Design: Persistence Control of Engineered Functions in Complex Soil Microbiomes

DOE PerCon SFA (2023-2026)

(PI: Robert Egbert @ Pacific Northwest National Laboratory )

Collaborating across highly integrated institutions, PerCon SFA scientists are exploring how environmental niches can be sculpted using the mechanisms of genome reduction and metabolic addiction to drive secure rhizosphere community design for robust biomass cropping in challenging environments. Our group works to develop improved Machine Learning methods to recognize similarities between proteins.

Dfam: sustainable growth, curation support, and improved quality for mobile element annotation

NIH 1U24HG010136 (2018-2028)

(PI: Arian Smit, co-PI: Robert Hubley @ Institute for Systems Biology )

Most of the vertebrate genome finds its ultimate origin in transposable elements (TEs), and the thorough annotation of TEs is a critical aspect of genome annotation pipelines. The goal of the proposed effort is to develop the infrastructure of Dfam to expand to 1000s of genomes, and to establish a self-sustaining TE Data Commons dependent on limited centralized curation. We will also improve the quality of repeat annotation through development of methods for more reliable alignment adjudication, to expand approaches to visualization of this complex data type, and to improve the modeling of TE subfamilies.

Discovery of Immunogenomic Associations with Disease and Differential Risk Across Diverse Populations

NIH 1R21HL172036 (2023-2025)

(PI: Jason Karnes, @ UArizona)

Genetic variation in immune-related genes, as in the human leukocyte antigen (HLA) locus, plays a pervasive role across organ systems. HLA variation, called HLA alleles, is used to match organ donors, and has been associated with adverse drug reactions (ADRs), cancer, infections, and cardiovascular and neurologic diseases. However, most studies focus on the impact of HLA variation on specific immune-mediated diseases; the broader influence of HLA variation across all human disease has not been investigated in depth. In Aim 1, HLA alleles will be determined using whole genome sequence data, and PheWAS will be deployed in AllofUs to explore ancestral differences in HLA/phenotype associations. In Aim 2 we will develop Machine Learning strategies to explore the effect of HLA allele interactions on disease, and explore the potential for recognizing pleiotropic influences of HLA alleles.

Development and Maintenance of RepeatMasker and RepeatModeler

NIH R01HG002939 (2022-2027)

(Multi-PI w/: Arian Smit and Robert Hubley @ Institute for Systems Biology )

Most of the vertebrate genome finds its ultimate origin in transposable elements (TEs), and their annotation is crucial for genome sequence analysis and our understanding of TEs unrivaled impact on genome biology and evolution. Their de novo discovery and description has become a bottleneck in the genome analysis of the thousands of new species sequenced every year. In this effort, we wll make foundational changes to the way RepeatMasker adjudicates TE alignments and assigns confidence to annotations, develop two paths to improving the generation of new TE libraries through the use of multi-species genome alignments and ancestral reconstructions, along with core algorithmic changes to our RepeatModeler discovery tool.

Building Knowledge About Alternatively-spliced Dual-Coding Exons

NIH R21HG012283 (2022-2024)

The goal of this study is to catalog the tissue- and development-specific splicing patterns of dual-coding exon variants, and to computationally explore their mechanisms of control and expected functional impact.

Overcoming Combinatoric Complexity Problems in Computational Mass Spectrometry

NSF 1933305 (2022-2024)

We will develop algorithms for improved identification of peptides from tandem mass spectromentry datasets.

Machine learning approaches for integrating multi-omics data to expand microbiome annotation

DOE DE-SC0021216 (2020-2023)

(Joint with Jason McDermott @ Pacific Northwest National Laboratory)

Communities of microbes in soil are key contributors to the plant-soil dynamic that supports production of food and fuel crops, for example driving nitrogen fixation, drought resistance, and nutrient cycling. The composition and interactions of these communities are of great importance, but these are often difficult to fully characterize due to challenges with sample acquisition, data processing, and community complexity and diversity. The effort supported by this grant will improve understanding of soil microbial communities through a combination of improved engineering for prototyped sequence annotation software, novel approaches in Deep Learning sequence annotation, and a new Bayesian method for integrating data from multiple high-throughput omics sources (particularly genomics and metabolomics).

Machine learning approaches for improved accuracy and speed in sequence annotation

NIH 1R01GM132600 (2019-2023)

Alignment of biological sequences is a key step in understanding their evolution, function, and patterns of activity. We will develop Machine Learning approaches to improve both accuracy and speed of highly-sensitive sequence alignment. To improve accuracy, we will develop methods based on both hidden Markov models and Artificial Neural Networks to reduce erroneous annotation caused by (1) the existence of low complexity and repetitive sequence and (2) the overextension of alignments of true homologs into unrelated sequence. We also address the issue of annotation speed, with development of a custom Deep Learning architecture designed to very quickly filter away large portions of candidate sequence comparisons prior to the relatively-slow sequence-alignment step.

Past grants

Learning and Neural Coding of Social Expectations

NIH 1R15MH117611 (2019-2022)

(PI: Nathan Insel @ University of Montana - Psychology)

The goals of this project relate to social cognition in Degus (highly social rodents). The Wheeler lab role involves development of machine learning methods for tracking of multiple animals in video and behavior classification in those videos.

Improved protein-DNA models for translated sequence search with profile HMMs

NIH 1R15HG009570-01 (2017-2020)

Fast and sensitive sequence database search is fundamental to modern molecular biology. This proposal describes a research plan to improve the accuracy of annotation of protein-coding content in sequenced genomes and metagenomic datasets. The research builds on established sequence database search software that employs probabilistic models to increase sensitivity through greater statistical power and ability to better model family complexity. The probabilistic models are called profile hidden Markov models (profile HMMs), and the software is HMMER.

The taxonomic breadth of sequenced datasets requires methods with the power to detect remote sequence similarity; raw data and sequencing errors demand models that recognize frameshifts and splice sites; and the massive scale of datasets demands that implementations be fast. My group will develop novel models for frameshifts and splice site detection in profile HMM homology search. Direct modeling of these features within search software effectively uses homology to guide ORF/gene prediction, which in turn leads to better homology detection. Through a combination of new algorithms and application of existing approaches, these models will be fast enough to use for large-scale annotation, such as in the EMBL European Bioinformatics Metagenomics Portal.

Methods for fast bio-sequence comparison with profile hidden Markov models

P20GM103546 NIH CoBRE (2017-2020)

With the continued explosive increase in genomic and metagenomic sequencing, the community requires effective and increasingly scalable methods to more fully decode, organize, and exploit sequence data. Accurate and complete annotation of a genomic dataset, based on sequence homology, is a critical first step in understanding its content. This annotation often boils down to sequence database search – the act of searching in a large sequence dataset to find sequences that are similar to known elements.

We aim to develop methods that will substantially improve the speed of sequence comparison with profile hidden Markov models, meeting the need for methods that are fast enough to accommodate large-scale databases, while still powerful enough to detect remote sequence similarity. We will implement these methods in the HMMER codebase, focusing on three complementary target optimizations. Specifically, the aims are:

Reducing false sequence annotation due to alignment overextension and repetitive sequences

P20GM103546 NIH CoBRE (2016-2017) Pilot grant

Sequence comparison is fundamental to modern molecular biology. Much effort has been expended in the development of methods to make comparison faster and more sensitive. Though the risk of false annotation is understood, the extent and key causes have only been lightly addressed. Two primary sources of false annotation are (1) the overextension of alignments of true homologs into unrelated sequence, and (2) the existence of low complexity sequence, especially when the query and target share similar patterns of repetitive sequence, such as atgatgatgatgatg (‘atg’, repeated). In our experience, these issues together cause >2% of all annotation to be incorrect, even with current strategies for avoiding the resulting errors. Furthermore, these strategies are themselves responsible for some loss in sensitivity to remote homology. This study will lay the groundwork for addressing both sources of false annotation. Specifically:

Lab stuff (by invite only)

Institutional memory


Home | People | Contact