To advance the family-based modelling approach, we are releasing the entire framework open source:
ProFam Atlas: A curated, large-scale training corpus containing nearly 40 million protein families.
Code & Weights: github.com/alex-hh/prof...
Data: zenodo.org/records/1771...
Posts by CATH-Gene3D
For design, ProFam-1 excels at homology-guided generation. It produces diverse sequences with low sequence identity to natural proteins while preserving predicted structural similarity and conservation patterns of the natural family, even when conditioning on just a single example sequence.
By conditioning on homologous sequences, ProFam-1 is competitive with state-of-the-art zero-shot fitness prediction on ProteinGym, outcompeting much larger PLMs such as ESM.
Built by CATH, TÜM and NVIDIA, ProFam-1 is our new open-source protein family language model (pfLM) designed to generate functional protein variants and predict fitness using in-context example sequences.
It was lovely to speak at the CATH 30 symposium, celebrating 30 years of the @cathgene3d.bsky.social protein structure classification database. I was presenting recent work on our new generative protein-family language model: preprint coming soon.
Rob Finn on MGnify, everything bacteria and functions in different environments
Now Maria Martín from UniProt is telling us how AI-based tools are shaping the future of one of the key resources for protein sequences and function.
From structures to sequences, now Alex Bateman and the quest to annotate and classify all proteins!
Starting our afternoon session with a talk by Sameer Velankar, of PDBe and AFDB fame among other endeavours!
And now @gonzaparra.bsky.social on his first talk on protein frustration as a PI! Well done!
David Jones, on novel folds in AFDB and CATH’s founding being celebrated at a now-closed Pizza place in Euston Station
From CATH to Computational Enzymology, Dame Janet Thornton on the birth of CATH and beyond!
First Keynote by Burkhard Rost, on the impact of protein language models on the field of structural biology
Kickstarting our symposium “Protein Annotations in the age of AI” at UCL!
Congratulations @judewells.bsky.social!
If you'd like to showcase your research with a poster, details are included in the registration page.
We hope to see you there!
We have a stellar lineup of speakers!
Christine Orengo
Burkhard Rost
Janet Thornton
David Jones
Gonzalo Parra @gonzaparra.bsky.social
Sameer Velankar
Alex Bateman
Maria Martin
Rob Finn
Gerardo Tauriello
Alexey Murzin
There will be talks from world leaders in structural bioinfomatics on various themes including pioneering protein language models and key international resources including: PDBe, InterPro, UniProt, MGnify, SWISS-MODEL, FrustraEvo and CATH.
CATH turns 30 years old this year!
We are organising a 1-day symposium on September 16th at UCL, highlighting recent AI-based developments to enhance protein family classifications, annotations and analyses.
www.eventbrite.co.uk/e/protein-an...
Another CATH outing at Greenwich Park after a lovely cruise along the Thames and a pub lunch!
Our latest preprint is out on bioRxiv!
A collaboration between the groups of @martinsteinegger.bsky.social , David Jones and Christine Orengo, we clustered AlphaFold Database and ESMatlas, a whopping 821 million proteins!
We reveal biome-specific groups & over 11k novel domain combinations.
TED is a collaborative project between the structural bioinformatics groups of Professor David Jones & Professor Christine Orengo @cathgene3d.bsky.social at @ucl.ac.uk.
The TED integration is set to enhance the interpretability and usability of #AlphaFold predictions. Is this useful in your work?
🚀 #AlphaFold Database update
AlphaFold DB now integrates The Encyclopedia of Domains (TED) – a resource designed to systematically identify & classify structural domains within AlphaFold-predicted protein structures.
www.ebi.ac.uk/about/news/u...
@pdbeurope.bsky.social
CATHmas lunch 2024!
We have updated sequences in our Functional Families by scanning FunFam-HMMs against UniProt release 2024_02, giving a 276% increase in FunFams coverage. The mapping of TED structural domains has resulted in a 4-fold increase in FunFams with structural information.
New PDB and TED data increases the number of superfamilies from 5841 to 6573, folds from 1349 to 2078 and architectures from 41 to 77.
CATH v4.4 represents an expansion of ∼64 844 experimentally determined domain structures from PDB. We also present a mapping of ∼90 million predicted domains from TED to CATH superfamilies.
We report a significant expansion of structural information (180-fold) for CATH superfamilies through classification of PDB domains and predicted domain structures from the Encyclopedia of Domains (TED) resource.
For those without access to the Science article, we added a full access link on the TED website (ted.cathdb.info) landing page!
6/6