Wheeler Lab

We're hiring!

We're actively recruiting a new PhD student to join us in the Wheeler Lab, working on a funded project to develop Machine Learning techniques to improve tracking of multiple animals in video recordings. We're also happy to chat with potential postdocs. Get in touch, and join us in stunning Missoula!

Computational Genomics at the University of Montana

The Wheeler lab designs algorithms and statistical methods for sequence analysis in computational genomics, and develops high-quality implementations of those methods for incorporation into highly-used software pipelines and web services. We are particularly focused on the annotation of biological sequences and the accompanying problem of searching for similar sequences within large-scale biological sequence databases.

Projects in our group range from statistical modeling of biological sequence families, to text indexing and bounded search algorithms, to low-level software optimization, to FPGAs, to Deep Neural Networks, to Natural Language Processing, to web services; from genomes, to proteins, to infectious disease, to animal tracking and behavior in videos.

The lab is located in the Computer Science Department at the University of Montana. We have open positions for motivated postdocs and PhD students. Please drop me a line if you'd like to discuss possibilities.

People living in a pretty great place

Check out our vibrant group of Computer Scientists and Biologists working together to do great things. The lab has open positions for motivated postdocs and PhD students. Please drop me a line if you'd like to discuss possibilities.

We're located in beautiful Missoula Montana

Calendar

For full-page view, click here
Schedule time with me

Courses

CSCI 232 (Data Structures and Algorithms)
Fall 2017, 2018
CSCI 332 (Design and Analysis of Algorithms)
Spring 2016, 2017
CSCI 480 / CSCI 580 (Applied Parallel Computing)
Spring 2016, 2017, 2019
CSCI 451 / CSCI 558 (Computational Biology / Bioinformatics)
Fall 2015, 2016, 2017
(Contact me if you wish to see material from past classes.)

Recent News (from my blog)

Papers

15 years of PhosphoSitePlus®: integrating post-translationally modified sites, disease variants and isoforms (2018) [link] [web service]
Hornbeck, P.V., Kornhauser, J.M., Latham, V., Murray, B., Nandhikonda, V., Nord, A., Skrzypek, E., Wheeler, T., Zhang, B., Gnad, F. Nucleic Acids Research, gky1159.
(https://doi.org/10.1093/nar/gky1159)
ULTRA: A Model Based Tool to Detect Tandem Repeats (2018) [link] [software]
Olson, D. and Wheeler, T.J. Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB), 37-46.
(doi:10.1145/3233547.3233604)
Supplementary material
Splice-Aware Multiple Sequence Alignment of Protein Isoforms [MIRAGE] (2018) [link] [software]
Nord, A., Hornbeck, P., Carey, K., and Wheeler, T.J. Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB), 200-210.
(doi:10.1145/3233547.3233592)
Supplementary material
Integration of protein phosphorylation, acetylation, and methylation data sets to outline lung cancer signaling networks (2018) [link] [web service]
Grimes, M., Hall, B., Foltz, L., Levy, T., Rikova, K., Gaiser, J., Cook, W., Smirnova, E., Wheeler, T., Clark, N.R., Lachmann, A., Zhang, B., Hornbeck, P., Maayan, A., and Comb, M. Science Signaling, 11:eaaq1087.
(doi:10.1126/scisignal.aaq1087)
The Dfam Database of Repetitive DNA Families (2016) [link] [web service]
Hubley, R., Finn, R.D., Clements, J., Eddy, S.R., Jones, T.A., Bao, W., Smit, A.F.A, and Wheeler, T.J. Nucleic Acids Research, 46:gkv1272.
(doi:10.1093/nar/gkv1272)
A call for benchmarking transposable element annotation methods. (2015) [link]
Hoen, D.R., Hickey, G., Bourque, G., Casacuberta, J., Cordaux, R., Feschotte, C., Fiston-Lavier, A-S., Hua-Van, A., Hubley, R., Kapusta, A., Lerat, E., Maumus, F., Pollock, D.D., Quesneville, H., Smit, A., Wheeler, T.J., Bureau, T.E., Blanchette, M. Mobile DNA, 6(1):1-9.
(doi:10.1186/s13100-015-0044-6)
HMMER web server: 2015 update. (2015) [link] [web service]
Finn, R.D., Clements, J., Arndt, W., Miller, B.L., Wheeler, T.J., Schreiber, F., Bateman, A., and Eddy, S.R. Nucleic Acids Research 43(W1):W30-W38.
(doi:10.1093/nar/gkv397)
Skylign: a tool for creating informative, interactive logos representing sequence alignments and profile hidden Markov models. (2014) [link] [web service]
Wheeler, T.J., Clements, J., and Finn, R.D. BMC Bioinformatics, 15:7.
(doi:10.1186/1471-2105-15-7)
Distinguished as a “Highly Accessed Article”.
nhmmer: DNA homology search with profile HMMs. (2013) [link] [software]
Wheeler, T.J., Eddy, S.R. Bioinformatics, 29(19):2487–2489.
(doi: 10.1093/bioinformatics/btt403)
In the top 20 most-read Bioinformatics articles during September 2013.
Supplementary material
Dfam: a database of repetitive DNA based on profile hidden Markov models. (2013) [link] [web service]
Wheeler, T.J., Clements, J., Eddy, S.R., Hubley, R., Jones, T.A., Jurka, J., Smit, A.F.A, and Finn, R.D. Nucleic Acids Research 41:D70–D82.
(doi: 10.1093/nar/gks1265)
Selected as an NAR Featured Article.
Supplementary material
Estimating the accuracy of multiple alignments and its use in parameter advising. (2012) [link]
DeBlasio, D., Wheeler, T., and Kececioglu, J. Proceedings of the 16th Conference on Research in Computational Molecular Biology (RECOMB), Springer-Verlag Lecture Notes in Bioinformatics, 7262: 45-59.
(doi: 10.1007/978-3-642-29627-7_5)
Complete nucleomorph genome sequence of the non-photosynthetic alga Cryptomonas paramecium reveals a core nucleomorph gene set. (2010) [link]
Tanifuji, G. Onodera, N.T., Wheeler, T.J., Dlutek, M., Donaher, N., and Archibald, J.M. Genome Biology and Evolution, 3: 44-54.
(doi: 10.1093/gbe/evq082)
Aligning protein sequences with predicted secondary structure. (2010) [link]
Kececioglu, J., Kim, E., and Wheeler, T. Journal of Computational Biology, 17(3): 561-580.
(doi:10.1089/cmb.2009.0222)
Selected as a "recommended read" for Faculty of 1000 Biology.
Efficient construction of accurate multiple alignments and large-scale phylogenies. (2009) [link]
Wheeler, T.J.
Ph.D. dissertation
Department of Computer Science, University of Arizona, Tucson, Arizona.
Large-scale neighbor-joining with NINJA. (2009) [link, preprint] [software]
Wheeler, T.J. Proceedings of the 9th Workshop on Algorithms in Bioinformatics (WABI), 375-389.
(doi: 10.1007/978-3-642-04241-6_31)
Learning models for aligning protein sequence with predicted secondary structure. (2009) [link]
Kim, E., Wheeler, T.J., and Kececioglu, J.D. Proceedings of the 13th Conference on Research in Computational Molecular Biology (RECOMB), Springer-Verlag Lecture Notes in Bioinformatics, 5541: 586-605.
(doi: 10.1007/978-3-642-02008-7_36)
Multiple alignment by aligning alignments. (2007) [link] [software]
Wheeler, T.J. and Kececioglu, J.D. Proceedings of the 15th ISCB Conference on Intelligent Systems for Molecular Biology, Bioinformatics, 23: i559-i568.
(doi: 10.1093/bioinformatics/btm226)
Adaptive protein evolution and regulatory divergence in Drosophila. (2006) [link]
Good, J.M., Hayden, C.A., and Wheeler T.J. Molecular Biology and Evolution, 23(6): 1101-1103.
(doi: 10.1093/molbev/msk002)
Evaluating and improving cDNA sequence quality with cQC. (2005) [link]
Hayden, C.A., Wheeler, T.J., and Jorgensen R.A. Bioinformatics, 21(24): 4414-4415.
(doi: 10.1093/bioinformatics/bti709)
Transposable element orientation bias in the Drosophila melanogaster genome. (2005) [link]
Cutter, A.D, Good, J.M., Pappas, C.T., Saunders, M.A., Starrett, D.M., Wheeler T.J. Journal of Molecular Evolution, 61(6): 733-741.
(doi: 10.1007/s00239-004-0243-0)

Software and databases

[github repo]
ULTRA [link]
A tool for locating and labeling tandemly-repetitive sequence.
Olson, D. and Wheeler, T.J.
2018
MIRAGE [link]
A tool for splice-aware multiple sequence alignment.
Nord, A. and Wheeler, T.J.
2018
Skylign Logo server [link]
A tool for creating informative, interactive logos representing sequence alignments and profile hidden Markov models.
Wheeler, T.J., Clements, J., Finn, R.D.
2013
HMMER webserver, and HMMER3.1 [link]
Biological sequence analysis using profile hidden Markov models.
Eddy, S.R. and Wheeler, T.J.
2013
nhmmer [link] (within HMMER3.1)
A DNA-DNA sequence homology search tool based on profile hidden Markov models, in the HMMER3 framework.
Wheeler, T.J. and Eddy, S.R.
2012
Dfam [link]
A Database of Repetitive DNA Based on Profile Hidden Markov Models.
Wheeler, T.J., Clements, J., Eddy, S.R., Hubley, R., Jones, T.A., Jurka, J., Smit, A.F.A, Finn, R.D.
2012
Ninja [link].
A Mesquite package for fast neighbor-joining phylogeny inference.
Wheeler, T.J. and Maddison, D.R.
2010
NINJA [link]
Software for large-scale neighbor-joining phylogeny inference.
Wheeler ,T.J.
2009
Opalescent [link]
A Mesquite package for multiple sequence alignment.
Wheeler, T.J. and Maddison, D.R.
2009
Opal [link]
Software for multiple sequence alignment by optimally aligning alignments.
Wheeler ,T.J. and Kececioglu. J.D.
2006
Align [link]
A Mesquite package for aligning sequence data.
Maddison, D.R., Wheeler, T.J., and Maddison, W.P.
2006
AlignAlign [link]
Software for optimally aligning alignments.
Starrett, D.M., Wheeler, T.J., and Kececioglu, J.D.
2005
cQC - cDNA Quality Control [link]
A tool for resolving putative sequencing errors in single-pass cDNA, based on genomic sequence.
Hayden, C.A. and Wheeler, T.J.
2005

Funding

Machine learning approaches for improved accuracy and speed in sequence annotation (NIH 1R01GM132600)
2019-2023
Alignment of biological sequences is a key step in understanding their evolution, function, and patterns of activity. We will develop Machine Learning approaches to improve both accuracy and speed of highly-sensitive sequence alignment. To improve accuracy, we will develop methods based on both hidden Markov models and Artificial Neural Networks to reduce erroneous annotation caused by (1) the existence of low complexity and repetitive sequence and (2) the overextension of alignments of true homologs into unrelated sequence. We also address the issue of annotation speed, with development of a custom Deep Learning architecture designed to very quickly filter away large portions of candidate sequence comparisons prior to the relatively-slow sequence-alignment step.
Learning and Neural Coding of Social Expectations (PI: Nathan Insel) [NIH 1R15MH117611]
2019-2022
Social handicaps associated with autism spectrum disorder and other mental health conditions could result from impairments in learning what to expect from other individuals. The goals of this research are to test two predictions of the hypothesis that social cognition depends on the medial prefrontal cortex (mPFC) due to its ability to learn expectation sets and/or associate these with a strategy for interacting with specific individuals. The first prediction is that rodent dyads develop specific patterns of interactions as the animals gain experience with one-another; the second is that neuron population activity in the mPFC discriminates more strongly between familiar compared with novel conspecifics.
Dfam: sustainable growth, curation support, and improved quality for mobile element annotation [NIH 1U24HG010136-01]
2018-2023
Most of the vertebrate genome finds its ultimate origin in transposable elements (TEs), and the thorough annotation of TEs is a critical aspect of genome annotation pipelines. The goal of the proposed effort is to develop the infrastructure of Dfam to expand to 1000s of genomes, and to establish a self-sustaining TE Data Commons dependent on limited centralized curation. We will also improve the quality of repeat annotation through development of methods for more reliable alignment adjudication, to expand approaches to visualization of this complex data type, and to improve the modeling of TE subfamilies.
Methods for fast bio-sequence comparison with profile hidden Markov models [P20GM103546 NIH CoBRE]
2018-2021
With the continued explosive increase in genomic and metagenomic sequencing, the community requires effective and increasingly scalable methods to more fully decode, organize, and exploit sequence data. Accurate and complete annotation of a genomic dataset, based on sequence homology, is a critical first step in understanding its content. This annotation often boils down to sequence database search – the act of searching in a large sequence dataset to find sequences that are similar to known elements.

The challenges posed by increased sequencing activity are particularly prevalent in metagenomics datasets, which are both highly diverse and very large. The high level of diversity means that accurate annotation requires extremely sensitive sequence analysis methods, while the scale necessitates ever increasing software speed. This proposal describes a plan to develop methods that will substantially improve the speed of sequence comparison with profile hidden Markov models, meeting the need for methods that are fast enough to accommodate large-scale databases, while still powerful enough to detect remote sequence similarity. We will implement these methods in the HMMER codebase, focusing on three complementary target optimizations. Specifically, the aims are:

- Index-based acceleration of the key filter stage of HMMER.
- Sparse completion of Forward/Backward Dynamic Programming matrix.
- Acceleration with FPGA configurable hardware.

All of these aims emphasize speed, with the approaches hedging and complementing each other. Overall, we expect 10x or better speed gain, with little-to-no loss in sensitivity for even for very sensitive profile HMM searches. This improvement to annotation speed will fundamentally alter the way in which large-scale sequence datasets are annotated, resulting in greatly improved understanding of genomic datasets.
Improved protein-DNA models for translated sequence search with profile HMMs [NIH 1R15HG009570-01]
2017-2020
Fast and sensitive sequence database search is fundamental to modern molecular biology. This proposal describes a research plan to improve the accuracy of annotation of protein-coding content in sequenced genomes and metagenomic datasets. The research builds on established sequence database search software that employs probabilistic models to increase sensitivity through greater statistical power and ability to better model family complexity. The probabilistic models are called profile hidden Markov models (profile HMMs), and the software is HMMER.

The taxonomic breadth of sequenced datasets requires methods with the power to detect remote sequence similarity; raw data and sequencing errors demand models that recognize frameshifts and splice sites; and the massive scale of datasets demands that implementations be fast. My group will develop novel models for frameshifts and splice site detection in profile HMM homology search. Direct modeling of these features within search software effectively uses homology to guide ORF/gene prediction, which in turn leads to better homology detection. Through a combination of new algorithms and application of existing approaches, these models will be fast enough to use for large-scale annotation, such as in the EMBL European Bioinformatics Metagenomics Portal.

The heights by great men reached and kept
were not obtained by sudden flight.
But they, while their companions slept,
were toiling upward in the night.
-- Henry Wadsworth Longfellow