This release was made possible by a strong cross-team effort, with key contributions from @erika-alden.bsky.social, Anjali Chadha, Dave Ross’s team at National Institute of Standards and Technology (NIST) and the DAMP Lab at @bostonu.bsky.social.
🔗 Access the dataset: data.alignbio.org
Posts by The Align Foundation
TEV protease is a cornerstone tool in biotechnology, known for its high substrate specificity, and this dataset provides a rich resource for enzyme engineering and ML-driven protein design.
To our knowledge, this is the largest mutational dataset on TEV protease to date. Notably, no comprehensive deep mutational scanning study across the full protein has been reported in over three decades, leaving key aspects of its functional landscape unexplored.
📢 Data Release Tuesday: Align TEV Protease Dataset
📊 We’re expanding the The Align Foundation data ecosystem again with ~30,000 high-quality GROQ-seq data points capturing TEV protease sequence–function relationships at scale.
To our knowledge, this is the largest mutational dataset on T7 RNA polymerase to date!
🔗 Access the dataset on the Align Data Portal: hubs.la/Q04802KK0
#OpenScience #SyntheticBiology #ProteinEngineering #BioAI #MachineLearning #AlignData #GROQSEQ #RNApolymerase
📢 Public Data Release: Align T7 RNA Polymerase Dataset.
📊The data keeps coming at Align! We’re excited to release our T7 RNA polymerase dataset, adding ~35,000 unique GROQ-seq data points to the growing Align data ecosystem, capturing sequence–function relationships across variants at scale.
🔗 Access the dataset on the Align Data Portal: hubs.la/Q0470NBn0
👏 Huge thanks to our collaborators and everyone involved in making this dataset possible.
This is the result of a fantastic collaboration with David Ross's Lab at NIST, @doelsnitz.bsky.social at @harvard.edu and the DAMP Lab. We’re thrilled to make the data publicly accessible to support the broader research community.
This dataset was generated using the GROQ-seq platform, enabling large-scale quantitative mapping of TF binding landscapes. This dataset demonstrates strong cross-site reproducibility, with highly consistent measurements generated independently at NIST and the DAMP Lab at @bostonu.bsky.social.
📊 100,000+ data points measuring transcription factor–DNA interactions across three regulators (LacI, RamR, VanR).
🚀 Public Data Release: GROQ-seq Transcription Factor Dataset
We’re excited to announce the public release of The Align Foundation's first GROQ-seq dataset, one of the largest high-resolution transcription factor datasets of its kind.
See our press release here: www.businesswire.com/news/home/20...
Align and Google DeepMind are partnering to build AI-ready datasets & evaluations for the future of predictive #AMR biology. Researchers worldwide can submit concepts through March 31 with roadmapping workshops coming to North America + APAC this spring. 🔗 docs.google.com/forms/d/e/1F...
Designed to be scalable and with closed-loop experimentation, Tesseract will power AI to map sequence → function, predict gene transfer success, and assign function to unknown genes from rich phenotypic signatures.
Big thanks to all the collaborators and reviewers who helped shape this proposal!
In partnership with @pioneerlabs.bsky.social, we’re proposing Tesseract: a large-scale, open microbial phenomics dataset to functionally annotate microbial genomes at scale. 🧬🤖
✅5M diverse genes x 50 host strains × 100 conditions
🔗 Read the proposal:https://zenodo.org/records/17990299
5/5
We understand this may cause inconvenience, and we appreciate your patience and continued support as we work to deliver a high-quality, impactful Tournament. For any questions or concerns, please contact us at tournament@alignbio.org
4/5
The new timeline is as follows:
-Registration deadline: Nov. 14th 2025
Predictive phase:
-Zero-shot track: Dec. 1st 2025 to Jan. 16th 2026
-Supervised track: Jan. 26th to Feb. 20th 2026
Generative phase:
-Round-1: Mar. 9th to Apr. 3rd 2026
-Round-2: Jul. 20th to Aug. 14th 2026
3/5
The Tournament will establish the first-of-its-kind benchmark for predictive and generative AI/ML models in engineering PETases. We remain committed to enabling real-world, transparent, and accessible benchmarking competitions, and this adjustment is to uphold that commitment.
2/5
This additional time will allow us to:
•Complete the generation of a high-quality training dataset 🧬🧪
•Secure the best sponsorships to expand access to the Tournament 🖥️💪
•Register more participants 👥🚀
1/5
To ensure the highest impact and participation, we are adjusting the timeline of the Protein Engineering Tournament. The Tournament will now begin on December 1st, 2025.
🧫 Meet the keynote speakers for the 2025 scverse conference!
Erika Alden DeBenedictis, Co-founder of The Align Foundation & CEO of Pioneer Labs
@erika-alden.bsky.social
@alignbio.bsky.social
@pioneerlabs.bsky.social
#scverse #scverse2025 #StructuralBiology #Biotech #Stanford
Register before Oct. 17 👉 alignbio.org/get-involved...
More about EvolutionaryScale’s API (Forge) 👉 forge.evolutionaryscale.ai
📣 Evolutionary Scale is supporting the 2025 Protein Engineering Tournament!
Teams get sponsored ESM inference through
@evolutionaryscale.bsky.social's API (Forge), bringing frontier protein AI models into the competition.
Level playing field, real experiments, open results. Register now 👇
Thrilled to welcome Amanda Reider Apel to The Align Foundation as Director of Protein Projects! 🚀 A biotech leader who builds high-impact science and thriving teams, she’ll drive protein engineering strategy while fostering a culture of innovation and collaboration.
Why PETase for our tournament? In 2024, the world made about 30 million tonnes of PET plastic, most from fossil fuels.
PETase can degrade PET, but isn’t ready for industrial-scale waste. The challenge: design an improved variant that can change that.
Register by Oct 17 alignbio.org/protein-engi...
4/4
What’s coming:
→ More protein functions
→ Sharper ML strategies
→ Public data access portal
→ And yes—we’re hiring.
🔗 alignbio.org/careers
#openbiology #machinelearning #proteinfunction #bioengineering
3/4
GROQ-seq is designed for reproducibility.
Standardized assays + distributed data collection = a foundation for robust, ML-ready protein function data.
Next: scale up function types, refine ML pipelines, expand access.
2/4
🧪 Year One in brief:
Started onboarding 6 protein function types
Launched multi-site runs at NIST + DAMP Lab
Achieved 100K–500K variant throughput per run
Standardized, modular, reproducible, open-source
📘 Full report: doi.org/10.5281/zeno...
1/4 We just published the 1st-year update on GROQ-seq: an open access, high-throughput experimental platform capable of testing hundreds of thousands of protein variants per run, running @ NIST and the DAMP Lab. Here’s what we built, learned, and what’s next:
#GROQseq #syntheticbiology
🔗 Sign up by Oct 17 → bit.ly/Tournament25
🔗 Want to read more? Last year’s amalyase engineering competition was recently published in PROTEINS: onlinelibrary.wiley.com/doi/10.1002/...
#AIforBiology #ProteinDesign #ClimateTech