We're actively recruiting a new PhD student to join us in the
Wheeler Lab, working on a funded project to
develop Machine Learning techniques to improve tracking of multiple animals in video recordings.
We're also happy to chat with potential postdocs. Get in touch, and join us in
Computational Genomics at the University of Montana
The Wheeler lab designs algorithms and statistical methods for
sequence analysis in computational genomics, and develops
high-quality implementations of those methods for incorporation
into highly-used software pipelines and web services. We are
particularly focused on the annotation of biological sequences
and the accompanying problem of searching for similar
sequences within large-scale biological sequence databases.
Projects in our group range from statistical modeling of
biological sequence families, to text indexing and bounded search
algorithms, to low-level software optimization, to FPGAs, to
Deep Neural Networks, to Natural Language Processing, to web
services; from genomes, to proteins, to infectious disease, to
animal tracking and behavior in videos.
The lab is located in the Computer
Science Department at the University
of Montana. We have open positions for motivated postdocs and PhD students.
Please drop me a line if you'd like to discuss possibilities.
15 years of PhosphoSitePlus®: integrating post-translationally modified sites, disease variants and isoforms (2018)
Hornbeck, P.V., Kornhauser, J.M., Latham, V., Murray, B., Nandhikonda, V., Nord, A., Skrzypek, E., Wheeler, T., Zhang, B., Gnad, F.
Nucleic Acids Research, gky1159.
ULTRA: A Model Based Tool to Detect Tandem Repeats (2018)
Olson, D. and Wheeler, T.J.
Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB), 37-46.
Splice-Aware Multiple Sequence Alignment of Protein Isoforms [MIRAGE] (2018)
Nord, A., Hornbeck, P., Carey, K., and Wheeler, T.J.
Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB), 200-210.
Integration of protein phosphorylation, acetylation, and methylation data sets to outline lung cancer signaling networks (2018)
Grimes, M., Hall, B., Foltz, L., Levy, T., Rikova, K., Gaiser, J., Cook, W.,
Smirnova, E., Wheeler, T., Clark, N.R., Lachmann, A., Zhang, B., Hornbeck, P., Maayan, A., and Comb, M.
Science Signaling, 11:eaaq1087.
Hubley, R., Finn, R.D., Clements, J., Eddy, S.R., Jones, T.A.,
Bao, W., Smit, A.F.A, and Wheeler, T.J.
Nucleic Acids Research, 46:gkv1272.
A call for benchmarking transposable element annotation methods. (2015)
Hoen, D.R., Hickey, G., Bourque, G., Casacuberta, J., Cordaux, R., Feschotte, C.,
Fiston-Lavier, A-S., Hua-Van, A., Hubley, R., Kapusta, A., Lerat, E., Maumus, F.,
Pollock, D.D., Quesneville, H., Smit, A., Wheeler, T.J., Bureau, T.E., Blanchette, M.
Mobile DNA, 6(1):1-9.
Estimating the accuracy of multiple alignments and its use in parameter advising. (2012)
DeBlasio, D., Wheeler, T., and Kececioglu, J.
Proceedings of the 16th Conference on Research in Computational Molecular Biology (RECOMB),
Springer-Verlag Lecture Notes in Bioinformatics, 7262: 45-59.
Complete nucleomorph genome sequence of the non-photosynthetic alga Cryptomonas paramecium
reveals a core nucleomorph gene set. (2010)
Tanifuji, G. Onodera, N.T., Wheeler, T.J., Dlutek, M., Donaher, N., and Archibald, J.M.
Genome Biology and Evolution, 3: 44-54.
Aligning protein sequences with predicted secondary structure. (2010)
Kececioglu, J., Kim, E., and Wheeler, T.
Journal of Computational Biology, 17(3): 561-580.
Proceedings of the 9th Workshop on Algorithms in Bioinformatics (WABI), 375-389.
Learning models for aligning protein sequence with predicted secondary structure. (2009)
Kim, E., Wheeler, T.J., and Kececioglu, J.D.
Proceedings of the 13th Conference on Research in Computational
Molecular Biology (RECOMB), Springer-Verlag Lecture Notes in Bioinformatics, 5541: 586-605.
Multiple alignment by aligning alignments. (2007)
Wheeler, T.J. and Kececioglu, J.D.
Proceedings of the 15th ISCB Conference on Intelligent Systems for Molecular Biology,
Bioinformatics, 23: i559-i568.
Adaptive protein evolution and regulatory divergence in Drosophila. (2006)
Good, J.M., Hayden, C.A., and Wheeler T.J.
Molecular Biology and Evolution, 23(6): 1101-1103.
Evaluating and improving cDNA sequence quality with cQC. (2005)
A tool for resolving putative sequencing errors in single-pass cDNA, based on genomic sequence.
Hayden, C.A. and Wheeler, T.J.
Machine learning approaches for improved accuracy and speed in sequence annotation (NIH 1R01GM132600)
Alignment of biological sequences is a key step in understanding their evolution, function,
and patterns of activity. We will develop Machine Learning approaches to improve both
accuracy and speed of highly-sensitive sequence alignment. To improve accuracy, we will
develop methods based on both hidden Markov models and Artificial Neural Networks to reduce
erroneous annotation caused by (1) the existence of low complexity and repetitive sequence
and (2) the overextension of alignments of true homologs into unrelated sequence. We also
address the issue of annotation speed, with development of a custom Deep Learning architecture
designed to very quickly filter away large portions of candidate sequence comparisons prior
to the relatively-slow sequence-alignment step.
Learning and Neural Coding of Social Expectations (PI: Nathan Insel)
Social handicaps associated with autism spectrum disorder and other mental health conditions could result from impairments in learning what to expect from other individuals. The goals of this research are to test two predictions of the hypothesis that social cognition depends on the medial prefrontal cortex (mPFC) due to its ability to learn expectation sets and/or associate these with a strategy for interacting with specific individuals. The first prediction is that rodent dyads develop specific patterns of interactions as the animals gain experience with one-another; the second is that neuron population activity in the mPFC discriminates more strongly between familiar compared with novel conspecifics.
Dfam: sustainable growth, curation support, and improved quality for mobile element annotation
Most of the vertebrate genome finds its ultimate origin in transposable elements (TEs), and the thorough annotation of TEs is a critical aspect of genome annotation pipelines. The goal of the proposed effort is to develop the infrastructure of Dfam to expand to 1000s of genomes, and to establish a self-sustaining TE Data Commons dependent on limited centralized curation. We will also improve the quality of repeat annotation through development of methods for more reliable alignment adjudication, to expand approaches to visualization of this complex data type, and to improve the modeling of TE subfamilies.
Methods for fast bio-sequence comparison with profile hidden Markov models
[P20GM103546 NIH CoBRE]
With the continued explosive increase in genomic and metagenomic sequencing, the community requires effective and increasingly scalable methods to more fully decode, organize, and exploit sequence data. Accurate and complete annotation of a genomic dataset, based on sequence homology, is a critical first step in understanding its content. This annotation often boils down to sequence database search – the act of searching in a large sequence dataset to find sequences that are similar to known elements.
The challenges posed by increased sequencing activity are particularly prevalent in metagenomics datasets, which are both highly diverse and very large. The high level of diversity means that accurate annotation requires extremely sensitive sequence analysis methods, while the scale necessitates ever increasing software speed. This proposal describes a plan to develop methods that will substantially improve the speed of sequence comparison with profile hidden Markov models, meeting the need for methods that are fast enough to accommodate large-scale databases, while still powerful enough to detect remote sequence similarity. We will implement these methods in the HMMER codebase, focusing on three complementary target optimizations. Specifically, the aims are:
- Index-based acceleration of the key filter stage of HMMER.
- Sparse completion of Forward/Backward Dynamic Programming matrix.
- Acceleration with FPGA configurable hardware.
All of these aims emphasize speed, with the approaches hedging and complementing each other. Overall, we expect 10x or better speed gain, with little-to-no loss in sensitivity for even for very sensitive profile HMM searches. This improvement to annotation speed will fundamentally alter the way in which large-scale sequence datasets are annotated, resulting in greatly improved understanding of genomic datasets.
Improved protein-DNA models for translated sequence search with profile HMMs
Fast and sensitive sequence database search is fundamental to modern molecular biology. This proposal describes a research plan to improve the accuracy of annotation of protein-coding content in sequenced genomes and metagenomic datasets. The research builds on established sequence database search software that employs probabilistic models to increase sensitivity through greater statistical power and ability to better model family complexity. The probabilistic models are called profile hidden Markov models (profile HMMs), and the software is HMMER.
The taxonomic breadth of sequenced datasets requires methods with the power to detect remote sequence similarity; raw data and sequencing errors demand models that recognize frameshifts and splice sites; and the massive scale of datasets demands that implementations be fast. My group will develop novel models for frameshifts and splice site detection in profile HMM homology search. Direct modeling of these features within search software effectively uses homology to guide ORF/gene prediction, which in turn leads to better homology detection. Through a combination of new algorithms and application of existing approaches, these models will be fast enough to use for large-scale annotation, such as in the EMBL European Bioinformatics Metagenomics Portal.
The heights by great men reached and kept
were not obtained by sudden flight.
But they, while their companions slept,
were toiling upward in the night.
-- Henry Wadsworth Longfellow