Wheeler Lab

thanks to Robyn Berg for this image of the 4th of July from the M

Computational Genomics at the University of Montana

The Wheeler lab designs algorithms and statistical methods for sequence analysis in computational genomics, and develops high-quality implementations of those methods for incorporation into highly-used software pipelines and web services. We are particularly focused on the alignment of biological sequences and the accompanying problem of searching for similar sequences within large-scale biological sequence databases.

We fixate on speed (developing algorithmic and engineering techniques to accelerate sequence alignment), sensitivity (developing models to improve recognition of similarity between highly divergent sequences, even in the face of sequencing error), and reliability (developing approaches that reduce the propensity of highly sensitive methods to make erroneous claims, especially in the context of confounding factors). This work largely relates to profile hidden Markov models. We apply our methods to the analysis and understanding of interesting biological phenomena, particularly transposable elements, conserved non-coding regions, and surprising features of genomes such as dual-coding exons.

The lab has open positions for motivated postdocs and PhD students. Please drop me a line if you'd like to discuss possibilities.

People living in a pretty great place

Check out our vibrant group of Computer Scientists and Biologists working together to do great things. The lab has open positions for motivated postdocs and PhD students. Please drop me a line if you'd like to discuss possibilities.

We're located in beautiful Missoula Montana

Calendar

For full-page view, click here
Schedule time with me

Courses

CSCI 232 (Data Structures and Algorithms)
Fall 2017, 2018
CSCI 332 (Design and Analysis of Algorithms)
Spring 2016, 2017
CSCI 480 / CSCI 580 (Applied Parallel Computing)
Spring 2016, 2017, 2019
CSCI 451 / CSCI 558 (Computational Biology / Bioinformatics)
Fall 2015, 2016, 2017
(Contact me if you wish to see material from past classes.)

Recent News (from my blog)

Papers

ULTRA: A Model Based Tool to Detect Tandem Repeats (2018) [link] [software]
Olson, D. and Wheeler, T.J. Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB), 37-46.
(doi:10.1145/3233547.3233604)
Supplementary material
Splice-Aware Multiple Sequence Alignment of Protein Isoforms [MIRAGE] (2018) [link] [software]
Nord, A., Hornbeck, P., Carey, K., and Wheeler, T.J. Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB), 200-210.
(doi:10.1145/3233547.3233592)
Supplementary material
Integration of protein phosphorylation, acetylation, and methylation data sets to outline lung cancer signaling networks (2018) [link] [web service]
Grimes, M., Hall, B., Foltz, L., Levy, T., Rikova, K., Gaiser, J., Cook, W., Smirnova, E., Wheeler, T., Clark, N.R., Lachmann, A., Zhang, B., Hornbeck, P., Maayan, A., and Comb, M. Science Signaling, 11:eaaq1087.
(doi:10.1126/scisignal.aaq1087)
The Dfam Database of Repetitive DNA Families (2016) [link] [web service]
Hubley, R., Finn, R.D., Clements, J., Eddy, S.R., Jones, T.A., Bao, W., Smit, A.F.A, and Wheeler, T.J. Nucleic Acids Research, 46:gkv1272.
(doi:10.1093/nar/gkv1272)
A call for benchmarking transposable element annotation methods. (2015) [link]
Hoen, D.R., Hickey, G., Bourque, G., Casacuberta, J., Cordaux, R., Feschotte, C., Fiston-Lavier, A-S., Hua-Van, A., Hubley, R., Kapusta, A., Lerat, E., Maumus, F., Pollock, D.D., Quesneville, H., Smit, A., Wheeler, T.J., Bureau, T.E., Blanchette, M. Mobile DNA, 6(1):1-9.
(doi:10.1186/s13100-015-0044-6)
HMMER web server: 2015 update. (2015) [link] [web service]
Finn, R.D., Clements, J., Arndt, W., Miller, B.L., Wheeler, T.J., Schreiber, F., Bateman, A., and Eddy, S.R. Nucleic Acids Research 43(W1):W30-W38.
(doi:10.1093/nar/gkv397)
Skylign: a tool for creating informative, interactive logos representing sequence alignments and profile hidden Markov models. (2014) [link] [web service]
Wheeler, T.J., Clements, J., and Finn, R.D. BMC Bioinformatics, 15:7.
(doi:10.1186/1471-2105-15-7)
Distinguished as a “Highly Accessed Article”.
nhmmer: DNA homology search with profile HMMs. (2013) [link] [software]
Wheeler, T.J., Eddy, S.R. Bioinformatics, 29(19):2487–2489.
(doi: 10.1093/bioinformatics/btt403)
In the top 20 most-read Bioinformatics articles during September 2013.
Supplementary material
Dfam: a database of repetitive DNA based on profile hidden Markov models. (2013) [link] [web service]
Wheeler, T.J., Clements, J., Eddy, S.R., Hubley, R., Jones, T.A., Jurka, J., Smit, A.F.A, and Finn, R.D. Nucleic Acids Research 41:D70–D82.
(doi: 10.1093/nar/gks1265)
Selected as an NAR Featured Article.
Supplementary material
Estimating the accuracy of multiple alignments and its use in parameter advising. (2012) [link]
DeBlasio, D., Wheeler, T., and Kececioglu, J. Proceedings of the 16th Conference on Research in Computational Molecular Biology (RECOMB), Springer-Verlag Lecture Notes in Bioinformatics, 7262: 45-59.
(doi: 10.1007/978-3-642-29627-7_5)
Complete nucleomorph genome sequence of the non-photosynthetic alga Cryptomonas paramecium reveals a core nucleomorph gene set. (2010) [link]
Tanifuji, G. Onodera, N.T., Wheeler, T.J., Dlutek, M., Donaher, N., and Archibald, J.M. Genome Biology and Evolution, 3: 44-54.
(doi: 10.1093/gbe/evq082)
Aligning protein sequences with predicted secondary structure. (2010) [link]
Kececioglu, J., Kim, E., and Wheeler, T. Journal of Computational Biology, 17(3): 561-580.
(doi:10.1089/cmb.2009.0222)
Selected as a "recommended read" for Faculty of 1000 Biology.
Efficient construction of accurate multiple alignments and large-scale phylogenies. (2009) [link]
Wheeler, T.J.
Ph.D. dissertation
Department of Computer Science, University of Arizona, Tucson, Arizona.
Large-scale neighbor-joining with NINJA. (2009) [link, preprint] [software]
Wheeler, T.J. Proceedings of the 9th Workshop on Algorithms in Bioinformatics (WABI), 375-389.
(doi: 10.1007/978-3-642-04241-6_31)
Learning models for aligning protein sequence with predicted secondary structure. (2009) [link]
Kim, E., Wheeler, T.J., and Kececioglu, J.D. Proceedings of the 13th Conference on Research in Computational Molecular Biology (RECOMB), Springer-Verlag Lecture Notes in Bioinformatics, 5541: 586-605.
(doi: 10.1007/978-3-642-02008-7_36)
Multiple alignment by aligning alignments. (2007) [link] [software]
Wheeler, T.J. and Kececioglu, J.D. Proceedings of the 15th ISCB Conference on Intelligent Systems for Molecular Biology, Bioinformatics, 23: i559-i568.
(doi: 10.1093/bioinformatics/btm226)
Adaptive protein evolution and regulatory divergence in Drosophila. (2006) [link]
Good, J.M., Hayden, C.A., and Wheeler T.J. Molecular Biology and Evolution, 23(6): 1101-1103.
(doi: 10.1093/molbev/msk002)
Evaluating and improving cDNA sequence quality with cQC. (2005) [link]
Hayden, C.A., Wheeler, T.J., and Jorgensen R.A. Bioinformatics, 21(24): 4414-4415.
(doi: 10.1093/bioinformatics/bti709)
Transposable element orientation bias in the Drosophila melanogaster genome. (2005) [link]
Cutter, A.D, Good, J.M., Pappas, C.T., Saunders, M.A., Starrett, D.M., Wheeler T.J. Journal of Molecular Evolution, 61(6): 733-741.
(doi: 10.1007/s00239-004-0243-0)

Software and databases

ULTRA [link]
A tool for locating and labeling tandemly-repetitive sequence.
Olson, D. and Wheeler, T.J.
2018
MIRAGE [link]
A tool for splice-aware multiple sequence alignment.
Nord, A. and Wheeler, T.J.
2018
Skylign Logo server [link]
A tool for creating informative, interactive logos representing sequence alignments and profile hidden Markov models.
Wheeler, T.J., Clements, J., Finn, R.D.
2013
HMMER webserver, and HMMER3.1 [link]
Biological sequence analysis using profile hidden Markov models.
Eddy, S.R. and Wheeler, T.J.
2013
nhmmer [link] (within HMMER3.1)
A DNA-DNA sequence homology search tool based on profile hidden Markov models, in the HMMER3 framework.
Wheeler, T.J. and Eddy, S.R.
2012
Dfam [link]
A Database of Repetitive DNA Based on Profile Hidden Markov Models.
Wheeler, T.J., Clements, J., Eddy, S.R., Hubley, R., Jones, T.A., Jurka, J., Smit, A.F.A, Finn, R.D.
2012
Ninja [link].
A Mesquite package for fast neighbor-joining phylogeny inference.
Wheeler, T.J. and Maddison, D.R.
2010
NINJA [link]
Software for large-scale neighbor-joining phylogeny inference.
Wheeler ,T.J.
2009
Opalescent [link]
A Mesquite package for multiple sequence alignment.
Wheeler, T.J. and Maddison, D.R.
2009
Opal [link]
Software for multiple sequence alignment by optimally aligning alignments.
Wheeler ,T.J. and Kececioglu. J.D.
2006
Align [link]
A Mesquite package for aligning sequence data.
Maddison, D.R., Wheeler, T.J., and Maddison, W.P.
2006
AlignAlign [link]
Software for optimally aligning alignments.
Starrett, D.M., Wheeler, T.J., and Kececioglu, J.D.
2005
cQC - cDNA Quality Control [link]
A tool for resolving putative sequencing errors in single-pass cDNA, based on genomic sequence.
Hayden, C.A. and Wheeler, T.J.
2005

Funding

Dfam: sustainable growth, curation support, and improved quality for mobile element annotation [NIH 1U24HG010136-01]
2018-2023
Most of the vertebrate genome finds its ultimate origin in transposable elements (TEs), and the thorough annotation of TEs is a critical aspect of genome annotation pipelines. Currently, the data used for this annotation is dominated by a single database, Repbase, whose restrictive license impedes coalescence on a central standardized resource. We have developed an innovative open access database (Dfam) that yields substantial gains in TE annotation quality and provides numerous novel features in support of community TE curation. We will grow and improve Dfam to create a sustainable, standardized, and open system for TE family data. To achieve this, we will develop Dfam’s infrastructure to scale to 1000s of genomes, develop improved methods for computing and representing the complex relationships between TE instances, and develop a framework for aiding curators in developing and sharing their TE libraries.
Methods for fast bio-sequence comparison with profile hidden Markov models [P20GM103546 NIH CoBRE]
2018-2021
With the continued explosive increase in genomic and metagenomic sequencing, the community requires effective and increasingly scalable methods to more fully decode, organize, and exploit sequence data. Accurate and complete annotation of a genomic dataset, based on sequence homology, is a critical first step in understanding its content. This annotation often boils down to sequence database search – the act of searching in a large sequence dataset to find sequences that are similar to known elements.

The challenges posed by increased sequencing activity are particularly prevalent in metagenomics datasets, which are both highly diverse and very large. The high level of diversity means that accurate annotation requires extremely sensitive sequence analysis methods, while the scale necessitates ever increasing software speed. This proposal describes a plan to develop methods that will substantially improve the speed of sequence comparison with profile hidden Markov models, meeting the need for methods that are fast enough to accommodate large-scale databases, while still powerful enough to detect remote sequence similarity. We will implement these methods in the HMMER codebase, focusing on three complementary target optimizations. Specifically, the aims are:

- Index-based acceleration of the key filter stage of HMMER.
- Sparse completion of Forward/Backward Dynamic Programming matrix.
- Acceleration with FPGA configurable hardware.

All of these aims emphasize speed, with the approaches hedging and complementing each other. Overall, we expect 10x or better speed gain, with little-to-no loss in sensitivity for even for very sensitive profile HMM searches. This improvement to annotation speed will fundamentally alter the way in which large-scale sequence datasets are annotated, resulting in greatly improved understanding of genomic datasets.
Improved protein-DNA models for translated sequence search with profile HMMs [NIH 1R15HG009570-01]
2017-2020
Fast and sensitive sequence database search is fundamental to modern molecular biology. This proposal describes a research plan to improve the accuracy of annotation of protein-coding content in sequenced genomes and metagenomic datasets. The research builds on established sequence database search software that employs probabilistic models to increase sensitivity through greater statistical power and ability to better model family complexity. The probabilistic models are called profile hidden Markov models (profile HMMs), and the software is HMMER.

The taxonomic breadth of sequenced datasets requires methods with the power to detect remote sequence similarity; raw data and sequencing errors demand models that recognize frameshifts and splice sites; and the massive scale of datasets demands that implementations be fast. My group will develop novel models for frameshifts and splice site detection in profile HMM homology search. Direct modeling of these features within search software effectively uses homology to guide ORF/gene prediction, which in turn leads to better homology detection. Through a combination of new algorithms and application of existing approaches, these models will be fast enough to use for large-scale annotation, such as in the EMBL European Bioinformatics Metagenomics Portal.

Lab meetings

view