Neil Thomas
I work at the intersection of AI and biology. I am currently at Biohub (fka EvolutionaryScale) building AI tools for biological design.
Previously, I was a Research Scientist at Google X. I completed my PhD
in Computer Science at UC Berkeley in 2022, advised by
Professor Yun S. Song. Prior to that, I was a
Software Engineer at 23andMe. I received my BS in
Engineering Mathematics and Statistics from UC Berkeley.
When I'm not being humbled by biology, I like to be humbled by a variety of hobbies. I like cooking
recipes from Alison Roman, climbing rocks, skiing, playing ultimate frisbee, cycling, playing piano,
watching comedy, and watering my plants.
email /
twitter /
bsky /
github /
scholar /
linkedin
|
|
Research Highlights
My research focuses on learning meaningful representations of proteins, with the aim of enabling
applications in protein design, functional annotation, and structure prediction. Check out my thesis
talk "Browsing in the Library of Babel" for an
accessible introduction.
|
|
Language Modeling Materializes a World Model of Protein Biology
Salvatore Candido*, Thomas Hayes*, Alexander Derry*, Roshan Rao*, Zeming Lin*, Robert Verkuil, Bryan Z. Wu, Jin Sub Lee, Elise S. Bruguera, Jehan A. Keval, Mykhailo Kopylov, John E. Pak, Wesley Wu, Neil Thomas, Samson Mataraso, Alvin Hsu, Ashton C. Trotman-Grant, Kilian Fatras, Allan dos Santos Costa, ..., Tom Sercu, Alexander Rives
bioRxiv, 2026
paper
/ code
/ tweetorial
/ talk
Introduces ESM Cambrian (ESMC), a 6B parameter protein language model trained on over 2B sequences. Introduces ESMFold2, a state of the art single sequence folding model which uses an ESMC backbone. Validates inversion of ESMFold2 for the design of nanomminiprotein and scFv binders with nanomolar affinities. Introduces the ESM Atlas, a map of 6.8B sequences and 1.1B predicted structures connected through interpretable features derived from ESMC’s latent space.
|
|
Engineering highly active and diverse nuclease enzymes by combining machine learning and ultra-high-throughput screening
Neil Thomas*, David Belanger*, Chenling Xu, Hanson Lee, Kathleen Hirano, Kosuke Iwai, Vanja Polic, Kendra Nyberg, Kevin Hoff, Lucas Frenz, Charlie Emrich, Jun W Kim, Mariya Chavarha, Abi Ramanan, Jeremy Agresti, Lucy J Colwell
Cell Systems, 2025
paper
/ code
/ tweetorial
/ talk
Introduces TeleProt, a framework for guiding protein library design with machine learning that leverages evolutionary and experimental data. Validated in an enzyme engineering campaign to optimize the endonuclease NucB. Across four rounds, ML-designed libraries show improved hit rate and diversity compared to directed evolution - the first head-to-head comparison of its kind.
|
|
Simulating 500 million years of evolution with a language model
Thomas Hayes*, Roshan Rao*, Halil Akin*, Nicholas James Sofroniew*, Deniz Oktay*, Zeming Lin*, Robert Verkuil*, Vincent Quy Tran, Jonathan Deaton, Marius Wiggert, Rohil Badkundri, Irhum Shafkat, Jun Gong, Alexander Derry, Raul Santiago Molina, Neil Thomas, Yousuf Khan, Chetan Mishra, Carolyn Kim, Liam J. Bartie, Patrick D. Hsu, Tom Sercu, Salvatore Candido, Alexander Rives
Science, 2025
paper
/ code
/ tweetorial
/ talk
/ blog
A multi-modal generative protein model that reasons flexibly across protein sequence, structure, and function.
|
|
Tuned Fitness Landscapes for Benchmarking Model-Guided Protein Design
Neil Thomas*, Atish Agarwala*, David Belanger, Yun S. Song, Lucy J. Colwell
bioRxiv, 2022
paper
/ code
/ tweetorial
Tunable, realistic, synthetic fitness landscapes for benchmarking protein design.
|
|
Interpreting Potts and Transformer Protein Models Through the Lens of Simplified Attention
Nicholas Bhattacharya*, Neil Thomas*, Roshan Rao, Justas Dauparas, Peter K. Koo, David Baker, Yun S. Song, Sergey Ovchinnikov
Pacific Symposium on Biocomputing, 2022
paper
/ code
/ tweetorial
/ talk
Introduces “factored attention,” a simplified attention layer that we use to compare and contrast Potts models and Transformers.
|
|
Evaluating Protein Transfer Learning with TAPE
Roshan Rao*, Nicholas Bhattacharya*, Neil Thomas*, Yan Duan, Xi Chen, John Canny, Pieter Abbeel, Yun S. Song
Advances in Neural Information Processing Systems (Spotlight), 2019
paper
/ code
/ tweetorial
/ talk
/ podcast
/ blog
A suite of benchmarking tasks for protein language models.
|
|
Whole-genome sequencing reveals a complex African population demographic history and signatures of local adaptation
Shaohua Fan, Jeffrey P. Spence, Yuanqing Feng, Matthew E.B. Hansen, Jonathan Terhorst, Marcia H. Beltrame, Alessia Ranciaro, Jibril Hirbo, William Beggs, Neil Thomas, Thomas Nyambo, Sununguko Wata Mpoloka, Gaonyadiwe George Mokone, Alfred K. Njamnshi, Charles Fokunang , Dawit Wolde Meskel, Gurja Belay, Yun S. Song, Sarah A. Tishkoff
Cell, 2023
paper
|
|
End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman
Samantha Petti, Nicholas Bhattacharya, Roshan Rao, Justas Dauparas, Neil Thomas, Juannan Zhou, Alexander M. Rush, Peter K. Koo, Sergey Ovchinnikov
Bioinformatics, 2022
paper
|
|
Functional genomics of OCTN2 variants informs protein-specific variant effect predictor for Carnitine Transporter Deficiency
Megan L. Koleske, Gregory McInnes, Julia E. H. Brown, Neil Thomas, Keino Hutchinson, Marcus Y. Chin, Antoine Koehl, Michelle R. Arkin, Avner Schlessinger, Renata C. Gallagher, Yun S. Song, Russ B. Altman, Kathleen M. Giacomini
PNAS, 2022
paper
|
|
Minding the gaps: The importance of navigating holes in protein fitness landscapes
Neil Thomas, Lucy Colwell
Cell Systems (Preview), 2021
paper
|
Teaching
During my graduate studies at Berkeley I had the privilege of teaching:
- Summer 2022 CS 188: Introduction to Artificial Intelligence
- Fall 2020 Stat 135: Concepts of Statistics
My teaching statement.
|
|