Home | People | Contact

Wheeler Lab

A view of Missoula

We’re hiring!

We’re actively recruiting a(t least one) new postdoc to join us in the Wheeler Lab, working on projects developing new Machine Learning methods to improve genome annotation and analysis of metagenomic and metabolomic data from microbiomes (e.g. in soil). Submit an application on the UM job portal. For light details, see the Funding section at the bottom of this page. Get in touch if you want to learn more about the projects.

Computational Genomics at the University of Montana

The Wheeler lab designs algorithms and statistical methods for problems motivated by biological data. We are particularly focused on the annotation of biological sequences and the accompanying problem of searching for similar sequences within large-scale biological sequence databases, but our work also addresses infectious disease and neuroscience.

Projects in our group range from statistical modeling of biological sequence families, to text indexing and bounded search algorithms, to low-level software optimization, to FPGAs, to Deep Neural Networks, to Natural Language Processing, to web services; from genomes, to proteins, to metagenomics, to drug discovery, to animal tracking and behavior.

The lab is located in the Computer Science Department at the University of Montana. We have open positions for motivated postdocs and PhD students. Please drop me a line if you’d like to learn more or discuss possibilities.

Resources: github | Institutional memory (by invite only)


People living in a pretty great place

Check out our vibrant group of Computer Scientists and Biologists working together to do great things.

We’re located in beautiful Missoula Montana. It’s a well-run, forward-thinking, top-10 college town, that’s also a top-25 place to live, with an amazing outdoor lifestyle (click all those links to see why). And, yeah, a river runs through it.

These days, you might also want to know how we’ve fared in the context of SARS-CoV2:

View data for other counties

Recent News (from my blog)

Publications

google scholar

[1] J. Storer, R. Hubley, J. Rosen, T. J. Wheeler, and A. F. A. Smit, “The Dfam community resource of transposable element families, sequence models, and genome annotation,” Mobile DNA (in review, preprint), 2020. https://doi.org/10.21203/rs.3.rs-76062/v1

[2] K. Carey, G. Patterson, and T. J. Wheeler, “Transposable element subfamily annotation is unreliable,” Mobile DNA (in review, preprint), Oct. 2020. https://doi.org/10.21203%2Frs.3.rs-86308%2Fv2

[3] P. R. Secor, E. B. Burgener, M. Kinnersley, L. K. Jennings, and V. Roman-Cruz et al., “Pf bacteriophage and their impact on pseudomonas virulence, mammalian immunity, and chronic infections,” Frontiers in Immunology, vol. 11, Feb. 2020. https://doi.org/10.3389%2Ffimmu.2020.00244

[4] M. Grimes, B. Hall, L. Foltz, T. Levy, and K. Rikova et al., “Integration of protein phosphorylation, acetylation, and methylation data sets to outline lung cancer signaling networks,” Science Signaling, vol. 11, no. 531, p. eaaq1087, May 2018. https://doi.org/10.1126%2Fscisignal.aaq1087

[5] A. Nord, P. Hornbeck, K. Carey, and T. Wheeler, “Splice-aware multiple sequence alignment of protein isoforms,” in Proceedings of the 2018 ACM international conference on bioinformatics, computational biology, and health informatics - BCB 18, 2018. https://doi.org/10.1145%2F3233547.3233592

[6] P. V. Hornbeck, J. M. Kornhauser, V. Latham, B. Murray, and V. Nandhikonda et al., “15 years of PhosphoSitePlus: Integrating post-translationally modified sites, disease variants and isoforms,” Nucleic Acids Research, vol. 47, no. D1, pp. D433–D441, Nov. 2018. https://doi.org/10.1093%2Fnar%2Fgky1159

[7] D. Olson and T. Wheeler, “ULTRA: A model based tool to detect tandem repeats,” in Proceedings of the 2018 ACM international conference on bioinformatics, computational biology, and health informatics - BCB 18, 2018. https://doi.org/10.1145%2F3233547.3233604

[8] R. Hubley, R. D. Finn, J. Clements, S. R. Eddy, T. A. Jones, W. Bao, A. F. A. Smit, and T. J. Wheeler, “The Dfam database of repetitive DNA families,” Nucleic Acids Research, vol. 44, no. D1, pp. D81–D89, Nov. 2015. https://doi.org/10.1093%2Fnar%2Fgkv1272

[9] R. D. Finn, J. Clements, W. Arndt, B. L. Miller, T. J. Wheeler, F. Schreiber, A. Bateman, and S. R. Eddy, “HMMER web server: 2015 update,” Nucleic Acids Research, vol. 43, no. W1, pp. W30–W38, May 2015. https://doi.org/10.1093%2Fnar%2Fgkv397

[10] D. R. Hoen, G. Hickey, G. Bourque, J. Casacuberta, and R. Cordaux et al., “A call for benchmarking transposable element annotation methods,” Mobile DNA, vol. 6, no. 1, Aug. 2015. https://doi.org/10.1186%2Fs13100-015-0044-6

[11] T. J. Wheeler, J. Clements, and R. D. Finn, “Skylign: A tool for creating informative, interactive logos representing sequence alignments and profile hidden markov models,” BMC Bioinformatics, vol. 15, no. 1, Jan. 2014. https://doi.org/10.1186%2F1471-2105-15-7

[12] T. J. Wheeler and S. R. Eddy, “nhmmer: DNA homology search with profile HMMs,” Bioinformatics, vol. 29, no. 19, pp. 2487–2489, Jul. 2013. https://doi.org/10.1093%2Fbioinformatics%2Fbtt403

[13] T. J. Wheeler, J. Clements, S. R. Eddy, R. Hubley, T. A. Jones, J. Jurka, A. F. A. Smit, and R. D. Finn, “Dfam: A database of repetitive DNA based on profile hidden markov models,” Nucleic Acids Research, vol. 41, no. D1, pp. D70–D82, Nov. 2012. https://doi.org/10.1093%2Fnar%2Fgks1265

[14] D. F. DeBlasio, T. J. Wheeler, and J. D. Kececioglu, “Estimating the accuracy of multiple alignments and its use in parameter advising,” in Lecture notes in computer science, Springer Berlin Heidelberg, 2012, pp. 45–59. https://doi.org/10.1007%2F978-3-642-29627-7_5

[15] G. Tanifuji, N. T. Onodera, T. J. Wheeler, M. Dlutek, N. Donaher, and J. M. Archibald, “Complete nucleomorph genome sequence of the nonphotosynthetic alga cryptomonas paramecium reveals a core nucleomorph gene set,” Genome Biology and Evolution, vol. 3, pp. 44–54, Dec. 2010. https://doi.org/10.1093%2Fgbe%2Fevq082

[16] J. Kececioglu, E. Kim, and T. Wheeler, “Aligning protein sequences with predicted secondary structure,” Journal of Computational Biology, vol. 17, no. 3, pp. 561–580, Mar. 2010. https://doi.org/10.1089%2Fcmb.2009.0222

[17] T. J. Wheeler, “Efficient construction of accurate multiple alignments and large-scale phylogenies,” PhD thesis, The University of Arizona., 2009. files/dissertation_wheeler_final.pdf

[18] E. Kim, T. Wheeler, and J. Kececioglu, “Learning models for aligning protein sequences with predicted secondary structure,” in Lecture notes in computer science, Springer Berlin Heidelberg, 2009, pp. 512–531. https://doi.org/10.1007%2F978-3-642-02008-7_36

[19] T. J. Wheeler, “Large-scale neighbor-joining with NINJA,” in Lecture notes in computer science, Springer Berlin Heidelberg, 2009, pp. 375–389. files/NINJA2009.pdf

[20] T. J. Wheeler and J. D. Kececioglu, “Multiple alignment by aligning alignments,” Bioinformatics, vol. 23, no. 13, pp. i559–i568, Jul. 2007. https://doi.org/10.1093%2Fbioinformatics%2Fbtm226

[21] J. M. Good, C. A. Hayden, and T. J. Wheeler, “Adaptive protein evolution and regulatory divergence in drosophila,” Molecular Biology and Evolution, vol. 23, no. 6, pp. 1101–1103, Mar. 2006. https://doi.org/10.1093%2Fmolbev%2Fmsk002

[22] C. A. Hayden, T. J. Wheeler, and R. A. Jorgensen, “Evaluating and improving cDNA sequence quality with cQC,” Bioinformatics, vol. 21, no. 24, pp. 4414–4415, Oct. 2005. https://doi.org/10.1093%2Fbioinformatics%2Fbti709

[23] A. D. Cutter, J. M. Good, C. T. Pappas, M. A. Saunders, D. M. Starrett, and T. J. Wheeler, “Transposable element orientation bias in the drosophila melanogaster genome,” Journal of Molecular Evolution, vol. 61, no. 6, pp. 733–741, Nov. 2005. https://doi.org/10.1007%2Fs00239-004-0243-0

Software and Databases

ULTRA

A tool for locating and labeling tandemly-repetitive sequence. Olson, D. and Wheeler, T.J. 2018

MIRAGE

A tool for splice-aware multiple sequence alignment. Nord, A. and Wheeler, T.J. 2018

Skylign Logo server

A tool for creating informative, interactive logos representing sequence alignments and profile hidden Markov models. Wheeler, T.J., Clements, J., Finn, R.D. 2013

HMMER webserver, and HMMER3.1

Biological sequence analysis using profile hidden Markov models. Eddy, S.R. and Wheeler, T.J. 2013

nhmmer (within HMMER3.1)

A DNA-DNA sequence homology search tool based on profile hidden Markov models, in the HMMER3 framework. Wheeler, T.J. and Eddy, S.R. 2012

Dfam

A Database of Repetitive DNA Based on Profile Hidden Markov Models. Wheeler, T.J., Clements, J., Eddy, S.R., Hubley, R., Jones, T.A., Jurka, J., Smit, A.F.A, Finn, R.D. 2012

Ninja

A Mesquite package for fast neighbor-joining phylogeny inference. Wheeler, T.J. and Maddison, D.R. 2010

NINJA

Software for large-scale neighbor-joining phylogeny inference. Wheeler ,T.J. 2009

Opalescent

A Mesquite package for multiple sequence alignment. Wheeler, T.J. and Maddison, D.R. 2009

Opal

Software for multiple sequence alignment by optimally aligning alignments. Wheeler ,T.J. and Kececioglu. J.D. 2006

Align

A Mesquite package for aligning sequence data. Maddison, D.R., Wheeler, T.J., and Maddison, W.P. 2006

AlignAlign

Software for optimally aligning alignments. Starrett, D.M., Wheeler, T.J., and Kececioglu, J.D. 2005

cQC - cDNA Quality Control

A tool for resolving putative sequencing errors in single-pass cDNA, based on genomic sequence. Hayden, C.A. and Wheeler, T.J. 2005

Funding

Machine learning approaches for integrating multi-omics data to expand microbiome annotation

DOE DE-SC0021216 (2020-2023)

(Joint with Jason McDermott @ Pacific Northwest National Laboratory)

Communities of microbes in soil are key contributors to the plant-soil dynamic that supports production of food and fuel crops, for example driving nitrogen fixation, drought resistance, and nutrient cycling. The composition and interactions of these communities are of great importance, but these are often difficult to fully characterize due to challenges with sample acquisition, data processing, and community complexity and diversity. The effort supported by this grant will improve understanding of soil microbial communities through a combination of improved engineering for prototyped sequence annotation software, novel approaches in Deep Learning sequence annotation, and a new Bayesian method for integrating data from multiple high-throughput omics sources (particularly genomics and metabolomics).

Machine learning approaches for improved accuracy and speed in sequence annotation

NIH 1R01GM132600 (2019-2023)

Alignment of biological sequences is a key step in understanding their evolution, function, and patterns of activity. We will develop Machine Learning approaches to improve both accuracy and speed of highly-sensitive sequence alignment. To improve accuracy, we will develop methods based on both hidden Markov models and Artificial Neural Networks to reduce erroneous annotation caused by (1) the existence of low complexity and repetitive sequence and (2) the overextension of alignments of true homologs into unrelated sequence. We also address the issue of annotation speed, with development of a custom Deep Learning architecture designed to very quickly filter away large portions of candidate sequence comparisons prior to the relatively-slow sequence-alignment step.

Learning and Neural Coding of Social Expectations

NIH 1R15MH117611 (2019-2022)

(PI: Nathan Insel @ University of Montana - Psychology)

The goals of this project relate to social cognition in Degus (highly social rodents). The Wheeler lab role involves development of machine learning methods for tracking of multiple animals in video and behavior classification in those videos.

Dfam: sustainable growth, curation support, and improved quality for mobile element annotation

NIH 1U24HG010136-01 (2018-2023)

(PI: Arian Smit, co-PI: Robert Hubley @ Institute for Systems Biology )

Most of the vertebrate genome finds its ultimate origin in transposable elements (TEs), and the thorough annotation of TEs is a critical aspect of genome annotation pipelines. The goal of the proposed effort is to develop the infrastructure of Dfam to expand to 1000s of genomes, and to establish a self-sustaining TE Data Commons dependent on limited centralized curation. We will also improve the quality of repeat annotation through development of methods for more reliable alignment adjudication, to expand approaches to visualization of this complex data type, and to improve the modeling of TE subfamilies.

Improved protein-DNA models for translated sequence search with profile HMMs

NIH 1R15HG009570-01 (2017-2020)

Fast and sensitive sequence database search is fundamental to modern molecular biology. This proposal describes a research plan to improve the accuracy of annotation of protein-coding content in sequenced genomes and metagenomic datasets. The research builds on established sequence database search software that employs probabilistic models to increase sensitivity through greater statistical power and ability to better model family complexity. The probabilistic models are called profile hidden Markov models (profile HMMs), and the software is HMMER.

The taxonomic breadth of sequenced datasets requires methods with the power to detect remote sequence similarity; raw data and sequencing errors demand models that recognize frameshifts and splice sites; and the massive scale of datasets demands that implementations be fast. My group will develop novel models for frameshifts and splice site detection in profile HMM homology search. Direct modeling of these features within search software effectively uses homology to guide ORF/gene prediction, which in turn leads to better homology detection. Through a combination of new algorithms and application of existing approaches, these models will be fast enough to use for large-scale annotation, such as in the EMBL European Bioinformatics Metagenomics Portal.

Past grants

Methods for fast bio-sequence comparison with profile hidden Markov models

P20GM103546 NIH CoBRE (2017-2020)

With the continued explosive increase in genomic and metagenomic sequencing, the community requires effective and increasingly scalable methods to more fully decode, organize, and exploit sequence data. Accurate and complete annotation of a genomic dataset, based on sequence homology, is a critical first step in understanding its content. This annotation often boils down to sequence database search – the act of searching in a large sequence dataset to find sequences that are similar to known elements.

We aim to develop methods that will substantially improve the speed of sequence comparison with profile hidden Markov models, meeting the need for methods that are fast enough to accommodate large-scale databases, while still powerful enough to detect remote sequence similarity. We will implement these methods in the HMMER codebase, focusing on three complementary target optimizations. Specifically, the aims are:

Reducing false sequence annotation due to alignment overextension and repetitive sequences

P20GM103546 NIH CoBRE (2016-2017) Pilot grant

Sequence comparison is fundamental to modern molecular biology. Much effort has been expended in the development of methods to make comparison faster and more sensitive. Though the risk of false annotation is understood, the extent and key causes have only been lightly addressed. Two primary sources of false annotation are (1) the overextension of alignments of true homologs into unrelated sequence, and (2) the existence of low complexity sequence, especially when the query and target share similar patterns of repetitive sequence, such as atgatgatgatgatg (‘atg’, repeated). In our experience, these issues together cause >2% of all annotation to be incorrect, even with current strategies for avoiding the resulting errors. Furthermore, these strategies are themselves responsible for some loss in sensitivity to remote homology. This study will lay the groundwork for addressing both sources of false annotation. Specifically:


Home | People | Contact