Home |
People |
Contact

Computational
Genomics, Drug Discovery, and more
The Wheeler lab designs algorithms and statistical methods for
problems motivated by biological data. Projects in our group range from
statistical modeling, to optimized data structures and algorithms, to
low-level software optimization, to AI (Deep Learning, etc), to web
services.
We are particularly focused on the annotation of biological sequences
and the accompanying problem of searching for similar sequences within
large-scale biological sequence databases (in the context of infectious
disease, soil microbiomes, and transposable elements). Our work also
touches heavily on drug discovery (AI methods and development of
supporting data resources) and neuroscience (animal tracking and
behavior).
We are located in the College
of Pharmacy at the University of
Arizona, in Tucson, which was recently named one of the best
places in the world to travel (one of only 3 places in the US!).
Awesome mountains, great weather, terrific research environment.
Publications
[1]
C.
R. Goubert, A. J. Nord, K. Sawyer, K. Youens-Clark, G. Lesica, and T. J.
Wheeler,
“Alternatively spliced dual-coding regions contribute to
the human gene regulatory program,” bioRxiv, Feb. 2026,
paper(doi):
10.64898/2026.02.24.707805.
[2]
M.
Kokot, A. Roy, T. J. Wheeler, and S. Deorowicz,
“MDCompress:
Better, faster compression of molecular dynamics simulation
trajectories,” Bioinformatics, vol. 42, no. 4, Apr.
2026, paper(doi):
10.1093/bioinformatics/btag176.
[3]
J.
W. Roddy, D. H. Rich, and T. J. Wheeler,
“nail: Software for high-speed, high-sensitivity
protein sequence annotation,” bioRxiv, 2026, paper(doi):
10.1101/2024.01.27.577580.
[4]
D.
Pettinga, C. Fonseca-García, G. Krause, H. Ploemacher, T. Wheeler, C. S.
Clendinen, P. Handakumbura, R. Egbert, and D. Coleman-Derr,
“Rational reduction of a sorghum SynCom that preserves growth
promotion reveals flavonoid-mediated plant microbe interactions,”
bioRxiv, Mar. 2026, paper(doi):
10.64898/2026.03.20.709941.
[5]
A.
Roy, E. Ward, I. Choi, M. Cosi, T. Edgin,
et al.,
“MDRepo
– an open data warehouse for community-contributed molecular dynamics
simulations of proteins,” Nucleic Acids Research, vol.
53, no. D1, pp. D477–D486, 2025, paper(doi):
10.1093/nar/gkae1109.
[6]
E.
Khajouei, V. Ghisays, I. S. Piras, K. L. Martinez, A. T. Vicenti,
et
al.,
“Phenome-wide association of APOE alleles in the all of
us research program,” eBioMedicine, vol. 117, p. 105768,
2025, paper(doi):
10.1016/j.ebiom.2025.105768.
[7]
X.
Chen, Z. Zhang, Y. Yan, C. Goubert, G. Bourque, and F. Inoue,
“A
phylogenetic approach uncovers cryptic endogenous retrovirus subfamilies
in the primate lineage,” Science Advances, vol. 11, no.
29, Jul. 2025, paper(doi):
10.1126/sciadv.ads9164.
[8]
I.
Robinson, G. Glidden-Handgis, N. Panchal, N. Insel, and T. Wheeler,
“DIPLOMAT: Multi-animal tracking with efficient manual
editing,” bioRxiv, 2025, paper(doi):
10.1101/2025.08.11.669786.
[9]
R.
E. Amaro, J. Åqvist, I. Bahar, F. Battistini, A. Bellaiche,
et
al.,
“The need to implement FAIR principles in biomolecular
simulations,” Nature Methods, vol. 22, no. 4, pp.
641–645, Apr. 2025, paper(doi):
10.1038/s41592-025-02635-0.
[10]
D.
Olson, T. Colligan, D. Demekas, J. W. Roddy, K. Youens-Clark, and T. J.
Wheeler,
“NEAR: Neural embeddings for amino acid
relationships,” Bioinformatics, vol. 41, pp. i449–i457,
2025, paper(doi):
10.1093/bioinformatics/btaf198.
[11]
J.
E. McDermott, W. C. Nelson, A. E. Zimmerman, W. Anthony, D.
Coleman-Derr,
et al.,
“Describing the persistence
landscape for introducing microbes into complex communities,”
arXiv, 2025, paper(doi):
10.48550/arXiv.2503.22133.
[12]
A.
N. Clements, A. L. Casillas, C. E. Flores, H. Liou, R. K. Toth,
et
al.,
“Inhibition of PIM kinase in tumor-associated
macrophages suppresses inflammasome activation and sensitizes prostate
cancer to immunotherapy,” Cancer Immunology Research,
2025, paper(doi):
10.1158/2326-6066.cir-24-0591.
[13]
P.
Ghosh, K. Fagnan, R. Connor, R. Pannu, T. J. Wheeler,
et al.,
“Contributions of the petabyte scale sequence search codeathon
toward efforts to scale sequence-based searches on SRA,”
arXiv preprint, 2025, paper(doi):
10.48550/arXiv.2505.06395.
[14]
J.
Gaiser and T. J. Wheeler,
“Simpatico: Accurate and ultra-fast
virtual drug screening with atomic embeddings,” bioRxiv,
2025, paper(doi):
10.1101/2025.06.08.658499.
[15]
T.
A. Nitka, J. Jacobson, C. H. Chang, G. R. Krause, T. J. Wheeler, R. G.
Egbert, W. C. Nelson, and J. E. McDermott,
“Snekmer learn/apply: A
kmer-based vector similarity approach to protein classification suitable
for metagenomic datasets,” bioRxiv, 2025, paper(doi):
10.1101/2025.05.16.654600.
[16]
G.
R. Krause, W. Shands, and T. J. Wheeler,
“Sensitive and error-tolerant annotation of protein-coding
DNA with BATH,” Bioinformatics Advances, Jun.
2024, paper(doi):
10.1093/bioadv/vbae088.
[17]
A.
J. Nord and T. J. Wheeler,
“Diviner uncovers hundreds of novel
human (and other) exons though comparative analysis of proteins,”
bioRxiv, 2024, paper(doi):
10.1101/2024.05.05.592595.
[18]
V.
Venkatraman, J. Gaiser, D. Demekas, A. Roy, R. Xiong, and T. J. Wheeler,
“Do molecular fingerprints identify diverse active drugs in
large-scale virtual screening? (no),” Pharmaceuticals,
vol. 17, no. 8, p. 992, Jul. 2024, paper(doi):
10.3390/ph17080992.
[19]
C.
Groza, X. Chen, T. J. Wheeler, G. Bourque, and C. Goubert,
“A
unified framework to analyze transposable element insertion
polymorphisms using graph genomes,” Nature
Communications, vol. 15, no. 1, Oct. 2024, paper(doi):
10.1038/s41467-024-53294-2.
[20]
T.
Anderson and T. Wheeler,
“An FPGA-based hardware accelerator
supporting sensitive sequence homology filtering with profile hidden
markov models,” BMC Bioinformatics, Jul. 2024,
paper(doi):
10.1186/s12859-024-05879-3.
[21]
D.
Geller-McGrath, K. M. Konwar, V. P. Edgcomb, M. Pachiadaki, J. W. Roddy,
T. J. Wheeler, and J. E. McDermott,
“Predicting metabolic modules
in incomplete bacterial genomes with MetaPathPredict,”
eLife, vol. 13, May 2024, paper(doi):
10.7554/elife.85749.
[22]
A.
C. Marbut, J. W. Chandler, and T. J. Wheeler,
“Exploring the
impact of a transformer’s latent space geometry on downstream task
performance,” arXiv, 2024, paper(doi):
10.48550/arXiv.2406.12159.
[23]
G.
Glidden-Handgis and T. J. Wheeler,
“WAS IT a MATch i SAW?
Approximate palindromes lead to overstated false match rates in
benchmarks using reversed sequences,” Bioinformatics
Advances, Apr. 2024, paper(doi):
10.1093/bioadv/vbae052.
[24]
D.
R. Olson and T. J. Wheeler,
“ULTRA-effective labeling of tandem
repeats in genomic sequence,” Bioinformatics Advances,
2024, paper(doi):
10.1093/bioadv/vbae149.
[25]
C.
J. Copeland, J. W. Roddy, A. K. Schmidt, P. R. Secor, and T. J. Wheeler,
“VIBES: A workflow for annotating and visualizing viral sequences
integrated into bacterial genomes [EDITOR’S
CHOICE],” NAR Genomics and Bioinformatics, vol.
6, no. 2, Apr. 2024, paper(doi):
10.1093/nargab/lqae030.
[26]
J.
M. Storer, J. A. Walker, J. N. Baker, S. Hossain, C. Roos, T. J.
Wheeler, and M. A. Batzer,
“Framework of the alu subfamily
evolution in the platyrrhine three-family clade of cebidae,
callithrichidae, and aotidae,” Genes, vol. 14, no. 2, p.
249, Jan. 2023, paper(doi):
10.3390/genes14020249.
[27]
J.
Schimunek, P. Seidl, K. Elez, T. Hempel, T. Le,
et al.,
“A community effort in SARS-CoV-2 drug
discovery,” Molecular Informatics, Nov. 2023,
paper(doi):
10.1002/minf.202300262.
[29]
T.
Colligan, K. Irish, D. J. Emlen, and T. J. Wheeler,
“DISCO: A deep learning ensemble for
uncertainty-aware segmentation of acoustic signals,” PLOS
ONE, vol. 18, no. 7, pp. 1–20, Jul. 2023, paper(doi):
10.1371/journal.pone.0288172.
[31]
V.
Venkatraman, T. H. Colligan, G. T. Lesica, D. R. Olson, J. Gaiser, C. J.
Copeland, T. J. Wheeler, and A. Roy,
“Drugsniffer: An open source
workflow for virtually screening billions of molecules for binding
affinity to protein targets,” Front. Pharmacol., vol.
13, Apr. 2022, paper(doi):
10.3389/fphar.2022.874746.
[32]
J.
F. Brodie, L. F. Henao-Diaz, B. Pratama, C. Copeland, T. Wheeler, and O.
E. Helmy,
“Fruit size in indo-malayan island plants is more
strongly influenced by filtering than by in situ evolution,”
The American Naturalist, Nov. 2022, paper(doi):
10.1086/723212.
[33]
R.
Hubley, T. J. Wheeler, and A. F. A. Smit,
“Accuracy of multiple sequence alignment methods in the
reconstruction of transposable element families [EDITOR’S
CHOICE],” NAR Genomics and Bioinformatics,
vol. 4, no. 2, May 2022, paper(doi):
10.1093/nargab/lqac040.
[34]
J.
W. Roddy, G. T. Lesica, and T. J. Wheeler,
“SODA: A
TypeScript/JavaScript library for visualizing
biological sequence annotation,” NAR Genomics
and Bioinformatics, vol. 4, no. 4, Oct. 2022, paper(doi):
10.1093/nargab/lqac077.
[35]
N.
Altemose, G. A. Logsdon, A. V. Bzikadze, P. Sidhwani, S. A. Langley,
et al.,
“Complete genomic and epigenetic maps of human
centromeres,” Science, vol. 376, no. 6588, Apr. 2022,
paper(doi):
10.1126/science.abl4178.
[36]
S.
J. Hoyt, J. M. Storer, G. A. Hartley, P. G. S. Grady, A. Gershman,
et al.,
“From telomere to telomere: The transcriptional
and epigenetic state of human repeat elements,” Science,
vol. 376, no. 6588, Apr. 2022, paper(doi):
10.1126/science.abk3112.
[38]
J.
Storer, R. Hubley, J. Rosen, T. J. Wheeler, and A. F. Smit,
“The Dfam community resource of transposable element
families, sequence models, and genome annotations,”
Mobile DNA, vol. 12, no. 1, p. 2, 2021, paper(doi):
10.1186/s13100-020-00230-y.
[40]
K.
M. Carey, R. Hubley, G. T. Lesica, D. Olson, J. W. Roddy, J. Rosen, A.
Shingleton, A. F. Smit, and T. J. Wheeler,
“PolyA: A
tool for adjudicating competing annotations of biological
sequences,” bioRxiv, 2021, paper(doi):
10.1101/2021.02.13.430877.
[41]
T.
A. Elliott, T. Heitkam, R. Hubley, H. Quesneville, A. Suh, and T. J.
Wheeler,
“TE hub: A community-oriented space for sharing and
connecting tools, data, resources, and methods for transposable element
annotation,” Mobile DNA, vol. 12, no. 1, p. 16, 2021,
paper(doi):
10.1186/s13100-021-00244-0.
[42]
P.
R. Secor, E. B. Burgener, M. Kinnersley, L. K. Jennings, V. Roman-Cruz,
et al.,
“Pf bacteriophage and their impact on pseudomonas
virulence, mammalian immunity, and chronic infections,”
Frontiers in Immunology, vol. 11, 2020, paper(doi):
10.3389/fimmu.2020.00244.
[43]
M.
Grimes, B. Hall, L. Foltz, T. Levy, K. Rikova,
et al.,
“Integration of protein phosphorylation, acetylation, and
methylation data sets to outline lung cancer signaling networks,”
Science Signaling, vol. 11, no. 531, p. eaaq1087, 2018,
paper(doi):
10.1126/scisignal.aaq1087.
[45]
P.
V. Hornbeck, J. M. Kornhauser, V. Latham, B. Murray, V. Nandhikonda,
et al.,
“15 years of
PhosphoSitePlus: Integrating
post-translationally modified sites, disease variants and
isoforms,” Nucleic Acids Research, vol. 47, no. D1, pp.
D433–D441, 2018, paper(doi):
10.1093/nar/gky1159.
[47]
R.
Hubley, R. D. Finn, J. Clements, S. R. Eddy, T. A. Jones, W. Bao, A. F.
A. Smit, and T. J. Wheeler,
“The Dfam database of
repetitive DNA families,” Nucleic Acids
Research, vol. 44, no. D1, pp. D81–D89, 2015, paper(doi):
10.1093/nar/gkv1272.
[48]
R.
D. Finn, J. Clements, W. Arndt, B. L. Miller, T. J. Wheeler, F.
Schreiber, A. Bateman, and S. R. Eddy,
“HMMER web
server: 2015 update,” Nucleic Acids Research, vol. 43,
no. W1, pp. W30–W38, 2015, paper(doi):
10.1093/nar/gkv397.
[49]
D.
R. Hoen, G. Hickey, G. Bourque, J. Casacuberta, R. Cordaux,
et
al.,
“A call for benchmarking transposable element annotation
methods,” Mobile DNA, vol. 6, no. 1, 2015,
paper(doi):
10.1186/s13100-015-0044-6.
[50]
T.
J. Wheeler, J. Clements, and R. D. Finn,
“Skylign: A tool for
creating informative, interactive logos representing sequence alignments
and profile hidden markov models,” BMC
Bioinformatics, vol. 15, no. 1, 2014, paper(doi):
10.1186/1471-2105-15-7.
[51]
T.
J. Wheeler, J. Clements, S. R. Eddy, R. Hubley, T. A. Jones, J. Jurka,
A. F. A. Smit, and R. D. Finn,
“Dfam: A database of
repetitive DNA based on profile hidden markov
models,” Nucleic Acids Research, vol. 41, no. D1, pp.
D70–D82, 2013, paper(doi):
10.1093/nar/gks1265.
(extra:
http://wheelerlab.org/publications/Wheeler13/Wheeler13.supplement.tar.gz)
[54]
J.
Kececioglu, E. Kim, and T. Wheeler,
“Aligning protein sequences
with predicted secondary structure,” Journal of Computational
Biology, vol. 17, no. 3, pp. 561–580, 2010, paper(doi):
10.1089/cmb.2009.0222.
[55]
G.
Tanifuji, N. T. Onodera, T. J. Wheeler, M. Dlutek, N. Donaher, and J. M.
Archibald,
“Complete nucleomorph genome sequence of the
nonphotosynthetic alga cryptomonas paramecium reveals a core nucleomorph
gene set,” Genome Biology and Evolution, vol. 3, pp.
44–54, 2010, paper(doi):
10.1093/gbe/evq082.
[59]
T.
J. Wheeler and J. D. Kececioglu,
“Multiple alignment by aligning
alignments,” Bioinformatics, vol. 23, no. 13, pp.
i559–i568, 2007, paper(doi):
10.1093/bioinformatics/btm226.
[60]
J.
M. Good, C. A. Hayden, and T. J. Wheeler,
“Adaptive protein
evolution and regulatory divergence in drosophila,” Molecular
Biology and Evolution, vol. 23, no. 6, pp. 1101–1103, 2006,
paper(doi):
10.1093/molbev/msk002.
[61]
C.
A. Hayden, T. J. Wheeler, and R. A. Jorgensen,
“Evaluating and
improving cDNA sequence quality with cQC,” Bioinformatics, vol. 21, no.
24, pp. 4414–4415, 2005, paper(doi):
10.1093/bioinformatics/bti709.
[62]
A.
D. Cutter, J. M. Good, C. T. Pappas, M. A. Saunders, D. M. Starrett, and
T. J. Wheeler,
“Transposable element orientation bias in the
drosophila melanogaster genome,” Journal of Molecular
Evolution, vol. 61, no. 6, pp. 733–741, 2005, paper(doi):
10.1007/s00239-004-0243-0.
Software
—>>>>> github
<<<<<—
Simpatico
(simple atomic interaction-prediction with
contrastive learning)
Graph Neural Network for atomistic-level embeddings that enable
accurate and hyper-fast search for seeking ligands compatable to a
target protein. Gaiser, J., and Wheeler, T.J. 2025
nail is a tool for protein sequence database search that is both very
fast and very sensitive. Roddy, J.R., Rich, D.H., and Wheeler, T.J.
2024
NEAR (Neural
Embeddings for Amino acid Relationships)
Neural representation for very fast and highly sensitive
alignment-free protein sequence search. Olson, D.R., Colligan, T.,
Demekas, D., and Wheeler, T.J. 2024
BATH (Protein-DNA
sequence search)
Better Alignments with Translated HMMER - Frameshift Aware Traslated
Hidden Markov Models for the Annotation of Protien Coding DNA. Krause,
G., Shands, W., and Wheeler, T.J. 2024
sufr (A suffix array
implementation in Rust) - (and the awry-FMindex library)
sufr is a tool and library for very fast, low memory construction of
a suffix array on protein or nucleotide sequence. It also implements
fast (and optionally low-memory) suffix array search. Youens-Clark, K.,
Roddy, J.R., and Wheeler, T.J. 2025
A tool for locating and labeling tandemly-repetitive sequence. Olson,
D. and Wheeler, T.J. 2024
DIPLOMAT (Tracking multiple
animals through video recordings)
Deep learning-based Identity Preserving Labeled-Object Multi-Animal
Tracking. Robinson, I., Insel, N., and Wheeler, T.J. 2023
Drugsniffer (Billion-scale virtual
drug screening)
An open source workflow for virtually screening billions of molecules
for binding affinity to protein targets. Venkatraman, V., Colligan,
T.H., Lesica, G.T, Olson, D.R., Gaiser J., Copeland, C., and Wheeler,
T.J., and Roy, A. 2022.
DISCO (Annotation
of sound blocks in audio recordings)
DISCO Implements Sound Classification Obediently. Colligan, T.,
Irish, K., Emlen, D.J., and Wheeler, T.J. 2022
SODA (A Library for building annotation
visualizations)
An Open Source Library for Visualizing Biological Sequence Annotation
Roddy, J., Lesica, G., and Wheeler, T.J. 2021
A fast, AVX2-accelerated FM-index library for hyper-fast string
pattern matching in nucleotide and amino sequences. Open source, C
library. Anderson, T. and Wheeler, T.J. 2021
A tool for splice-aware multiple sequence alignment. Nord, A. and
Wheeler, T.J. 2018
Biological sequence analysis using profile hidden Markov models.
Eddy, S.R. and Wheeler, T.J. 2013
A DNA-DNA sequence homology search tool based on profile hidden
Markov models, in the HMMER3 framework. Wheeler, T.J. and Eddy, S.R.
2012
A Mesquite package for fast neighbor-joining phylogeny inference.
Wheeler, T.J. and Maddison, D.R. 2010
Software for large-scale neighbor-joining phylogeny inference.
Wheeler ,T.J. 2009
A Mesquite package for multiple sequence alignment. Wheeler, T.J. and
Maddison, D.R. 2009
Software for multiple sequence alignment by optimally aligning
alignments. Wheeler, T.J. and Kececioglu. J.D. 2006
A Mesquite package for aligning sequence data. Maddison, D.R.,
Wheeler, T.J., and Maddison, W.P. 2006
Software for optimally aligning alignments. Starrett, D.M., Wheeler,
T.J., and Kececioglu, J.D. 2005
A tool for resolving putative sequencing errors in single-pass cDNA,
based on genomic sequence. Hayden, C.A. and Wheeler, T.J. 2005
Web Services
and Databases
MDRepo (A database of molecular dynamics
simulations)
An open repository of MD simulations for proteins, with or without
ligands, generated by the worldwide community of researchers Roy, A.,
Ward, E., …, Wheeler, T.J. 2024-.
TE Hub (A place for TE researchers to
connect to communty)
A place where researchers working on Transposable Elements (TEs) can
catalog available online resources. It is organized as a collection of
wiki pages, enabling community contribution and collaboration. The TE
Hub Consortium, Elliott T., Heitkam T., Hubley R., Quesneville H., Suh
A., Wheeler T.J. 2021-.
Dfam (A database of transposable element
families)
A Database of Repetitive DNA Based on Profile Hidden Markov Models.
Hubley, R. Smit, A.F.A, …, Wheeler, T.J.. 2012-.
A tool for creating informative, interactive logos representing
sequence alignments and profile hidden Markov models. Wheeler, T.J.,
Clements, J., Finn, R.D. 2013-.
Funding
Generative
AI to Predict and Drug Pathological Mitochondrial Fission
Skaggs Scholars Program
(PI: Blake
Hill @ University of
Colorado Anschutz, School of Pharmacy )
Excessive mitochondrial fission mediated by the FIS1-DRP1 interaction
drives dysfunction in type 2 diabetes, neurodegeneration, and ischemic
injury. No experimental structure of the FIS1-DRP1 complex exists,
hindering rational drug design. The focus of this project is to develop
a structural ensemble of the complex that will aid in drug development,
using a combination of AI and molecular dynamics simulation
approaches.
NIH
1U01DE034176 (2024-2028)
(PI: Jason
McDermott @ Pacific Northwest
National Laboratory )
This project will help understand the impact of bacteriophage,
viruses infecting bacteria, on human health by developing new
computational tools to understand the function of these viruses. The
results of this project could illuminate causes of diseases that are
linked to the microbiome and help provide therapies for treatment of the
microbiome to enhance health.
SFA-Secure
Biosystems Design: Persistence Control of Engineered Functions in
Complex Soil Microbiomes
DOE PerCon
SFA (2023-2026)
(PI: Robert
Egbert @ Pacific Northwest National
Laboratory )
Collaborating across highly integrated institutions, PerCon SFA
scientists are exploring how environmental niches can be sculpted using
the mechanisms of genome reduction and metabolic addiction to drive
secure rhizosphere community design for robust biomass cropping in
challenging environments. Our group works to develop improved Machine
Learning methods to recognize similarities between proteins.
Dfam:
sustainable growth, curation support, and improved quality for mobile
element annotation
NIH
1U24HG010136 (2018-2028)
(PI: Arian
Smit, co-PI: Robert Hubley @ Institute for Systems Biology )
Most of the vertebrate genome finds its ultimate origin in
transposable elements (TEs), and the thorough annotation of TEs is a
critical aspect of genome annotation pipelines. The goal of the proposed
effort is to develop the infrastructure of Dfam to expand to 1000s of
genomes, and to establish a self-sustaining TE Data Commons dependent on
limited centralized curation. We will also improve the quality of repeat
annotation through development of methods for more reliable alignment
adjudication, to expand approaches to visualization of this complex data
type, and to improve the modeling of TE subfamilies.
Development
and Maintenance of RepeatMasker and RepeatModeler
NIH
R01HG002939 (2022-2027)
(Multi-PI w/: Arian Smit and Robert Hubley @ Institute for Systems Biology )
Most of the vertebrate genome finds its ultimate origin in
transposable elements (TEs), and their annotation is crucial for genome
sequence analysis and our understanding of TEs unrivaled impact on
genome biology and evolution. Their de novo discovery and description
has become a bottleneck in the genome analysis of the thousands of new
species sequenced every year. In this effort, we wll make foundational
changes to the way RepeatMasker adjudicates TE alignments and assigns
confidence to annotations, develop two paths to improving the generation
of new TE libraries through the use of multi-species genome alignments
and ancestral reconstructions, along with core algorithmic changes to
our RepeatModeler discovery tool.
Past grants
Overcoming
Combinatoric Complexity Problems in Computational Mass Spectrometry
NSF 1933305
(2022-2024)
We will develop algorithms for improved identification of peptides
from tandem mass spectromentry datasets.
Discovery
of Immunogenomic Associations with Disease and Differential Risk Across
Diverse Populations
NIH
1R21HL172036 (2023-2025)
(PI: Jason
Karnes, @ UArizona)
Genetic variation in immune-related genes, as in the human leukocyte
antigen (HLA) locus, plays a pervasive role across organ systems. HLA
variation, called HLA alleles, is used to match organ donors, and has
been associated with adverse drug reactions (ADRs), cancer, infections,
and cardiovascular and neurologic diseases. However, most studies focus
on the impact of HLA variation on specific immune-mediated diseases; the
broader influence of HLA variation across all human disease has not been
investigated in depth. In Aim 1, HLA alleles will be determined using
whole genome sequence data, and PheWAS will be deployed in AllofUs to
explore ancestral differences in HLA/phenotype associations. In Aim 2 we
will develop Machine Learning strategies to explore the effect of HLA
allele interactions on disease, and explore the potential for
recognizing pleiotropic influences of HLA alleles.
Building
Knowledge About Alternatively-spliced Dual-Coding Exons
NIH
R21HG012283 (2022-2024)
The goal of this study is to catalog the tissue- and
development-specific splicing patterns of dual-coding exon variants, and
to computationally explore their mechanisms of control and expected
functional impact.
Machine
learning approaches for improved accuracy and speed in sequence
annotation
NIH
1R01GM132600 (2019-2024)
Alignment of biological sequences is a key step in understanding
their evolution, function, and patterns of activity. We will develop
Machine Learning approaches to improve both accuracy and speed of
highly-sensitive sequence alignment. To improve accuracy, we will
develop methods based on both hidden Markov models and Artificial Neural
Networks to reduce erroneous annotation caused by (1) the existence of
low complexity and repetitive sequence and (2) the overextension of
alignments of true homologs into unrelated sequence. We also address the
issue of annotation speed, with development of a custom Deep Learning
architecture designed to very quickly filter away large portions of
candidate sequence comparisons prior to the relatively-slow
sequence-alignment step.
Integrating
Deep Learning Methods with Molecular Surface Properties to Improve Drug
Screening
Arizona
TRIF initiative (2023-2024)
Virtual drug screening will dramatically expand the diversity of
explored candidate drugs, while reducing time and cost of discovery. We
will extend development of AI methods to predict good drug candidates
for a target protein. Models will explore billions of candidate
synthesizable drugs, and will complete development of a first-in-class
repository of drug interaction simulations.
Machine
learning approaches for integrating multi-omics data to expand
microbiome annotation
DOE DE-SC0021216
(2020-2023)
(Joint with Jason
McDermott @ Pacific Northwest
National Laboratory)
Communities of microbes in soil are key contributors to the
plant-soil dynamic that supports production of food and fuel crops, for
example driving nitrogen fixation, drought resistance, and nutrient
cycling. The composition and interactions of these communities are of
great importance, but these are often difficult to fully characterize
due to challenges with sample acquisition, data processing, and
community complexity and diversity. The effort supported by this grant
will improve understanding of soil microbial communities through a
combination of improved engineering for prototyped sequence annotation
software, novel approaches in Deep Learning sequence annotation, and a
new Bayesian method for integrating data from multiple high-throughput
omics sources (particularly genomics and metabolomics).
Learning and
Neural Coding of Social Expectations
NIH
1R15MH117611 (2019-2022)
(PI: Nathan
Insel @ University of
Montana - Psychology)
The goals of this project relate to social cognition in Degus (highly
social rodents). The Wheeler lab role involves development of machine
learning methods for tracking of multiple animals in video and behavior
classification in those videos.
Improved
protein-DNA models for translated sequence search with profile HMMs
NIH
1R15HG009570-01 (2017-2020)
Fast and sensitive sequence database search is fundamental to modern
molecular biology. This proposal describes a research plan to improve
the accuracy of annotation of protein-coding content in sequenced
genomes and metagenomic datasets. The research builds on established
sequence database search software that employs probabilistic models to
increase sensitivity through greater statistical power and ability to
better model family complexity. The probabilistic models are called
profile hidden Markov models (profile HMMs), and the software is
HMMER.
The taxonomic breadth of sequenced datasets requires methods with the
power to detect remote sequence similarity; raw data and sequencing
errors demand models that recognize frameshifts and splice sites; and
the massive scale of datasets demands that implementations be fast. My
group will develop novel models for frameshifts and splice site
detection in profile HMM homology search. Direct modeling of these
features within search software effectively uses homology to guide
ORF/gene prediction, which in turn leads to better homology detection.
Through a combination of new algorithms and application of existing
approaches, these models will be fast enough to use for large-scale
annotation, such as in the EMBL European Bioinformatics Metagenomics
Portal.
Methods
for fast bio-sequence comparison with profile hidden Markov models
P20GM103546 NIH CoBRE
(2017-2020)
With the continued explosive increase in genomic and metagenomic
sequencing, the community requires effective and increasingly scalable
methods to more fully decode, organize, and exploit sequence data.
Accurate and complete annotation of a genomic dataset, based on sequence
homology, is a critical first step in understanding its content. This
annotation often boils down to sequence database search – the act of
searching in a large sequence dataset to find sequences that are similar
to known elements.
We aim to develop methods that will substantially improve the speed
of sequence comparison with profile hidden Markov models, meeting the
need for methods that are fast enough to accommodate large-scale
databases, while still powerful enough to detect remote sequence
similarity. We will implement these methods in the HMMER codebase,
focusing on three complementary target optimizations. Specifically, the
aims are:
- Index-based acceleration of the key filter stage of HMMER
- Sparse completion of Forward/Backward Dynamic Programming
matrix
- Acceleration with FPGA configurable hardware
Reducing
false sequence annotation due to alignment overextension and repetitive
sequences
P20GM103546 NIH CoBRE
(2016-2017) Pilot grant
Sequence comparison is fundamental to modern molecular biology. Much
effort has been expended in the development of methods to make
comparison faster and more sensitive. Though the risk of false
annotation is understood, the extent and key causes have only been
lightly addressed. Two primary sources of false annotation are (1) the
overextension of alignments of true homologs into unrelated sequence,
and (2) the existence of low complexity sequence, especially when the
query and target share similar patterns of repetitive sequence, such as
atgatgatgatgatg (‘atg’, repeated). In our experience, these issues
together cause >2% of all annotation to be incorrect, even with
current strategies for avoiding the resulting errors. Furthermore, these
strategies are themselves responsible for some loss in sensitivity to
remote homology. This study will lay the groundwork for addressing both
sources of false annotation. Specifically:
- (Alignment overextension) we will perform a survey of existing
methods for mitigating overextension, and prototype two novel methods
for limiting overextension.
- (Repetitive sequence) We will develop a probabilistic hidden Markov
model that represents random genomic and protein sequenc
Lab stuff (by invite only)
Institutional
memory
Conferences
Home |
People |
Contact