My time in @martinsteinegger.bsky.social's group is ending, but I’m staying in Korea to build a lab at Sungkyunkwan University School of Medicine. If you or someone you know is interested in molecular machine learning and open-source bioinformatics, please reach out. I am hiring!
mirdita.org
Posts by Amy Lu
Just coincidentally found GenBank Release 84.0 from 1994 in the neighboring lab. Anyone out there with an even older version?
In case you missed our ML for proteins seminar on CHEAP compression for protein embeddings back in October, here it is -- thanks @megthescientist.bsky.social for doing so much for the MLxProteins community 🫶
•introduced “zero shot prediction” as a question of guessing a bioassay’s outcome by likelihoods of pLMs
•commented on biases in evolutionary signals from Tree of life used to train pLMs (a favorite paper I read in 2024: shorturl.at/fbC7g)
Thanks @workshopmlsb.bsky.social for letting us share our work!
🔗📄 bit.ly/plaid-proteins
Another straightforward application is generation, either by next-token sampling or MaskGIT style denoising. We made the tokenized version of CHEAP to do generation, and decided to go with diffusion on continuous embeddings instead — but I think either would’ve worked
We trained a model to co-generate protein sequence and structure by working in the ESMFold latent space, which encodes both. PLAID only requires sequences for training but generates all-atom structures!
Really proud of @amyxlu.bsky.social 's effort leading this project end-to-end!
immensely grateful for awesome collaborators on this work: Wilson Yan, Sarah Robinson, @kevinkaichuang.bsky.social, Vladimir Gligorijevic, @kyunghyuncho.bsky.social, Rich Bonneau, Pieter Abbeel, @ncfrey.bsky.social 🫶
6/ We'll get to share PLAID as an oral presentation at MLSB next week 🥳 In the meantime, checkout:
📄Preprint: biorxiv.org/content/10.1...
👩💻Code: github.com/amyxlu/plaid
🏋️Weights: huggingface.co/amyxlu/plaid...
🌐Website: amyxlu.github.io/plaid/
🍦Server: coming soon!
conditioning on organism and function shows that PLAID has learned active site residues and sidechain positions!
5/🚀 ...and when prompted by function, PLAID learns sequence motifs at active sites & directly outputs sidechain positions, which backbone-only methods such as RFDiffusion can't do out-of-the-box.
The residues aren't directly adjacent, suggesting that the model isn't simply memorizing training data:
unconditional generations from PLAID
4/ On unconditional generation, PLAID generates high quality and diverse structures, especially at longer sequence lengths where previous methods underperform...
noising by a diffusion schedule in the latent space doesn't always correspond to the same corruption in the sequence and structure space...
3/ I was pretty stuck until building out the CHEAP (bit.ly/cheap-proteins) autoencoders that compressed & smoothed out the latent space: interestingly, gradual noise added to the ESMFold latent space doesn't actually corrupt the sequence and structure until the final forward diffusion timesteps 🤔
how does the PLAID approach work?
2/💡Co-generating sequence and structure is hard. A key insight is that to get embeddings of the ESMFold latent space during training, we only need sequence inputs.
For inference, we can sample latent embeddings & use frozen sequence/structure decoders to get all-atom structure:
overview of results for PLAID!
1/🧬 Excited to share PLAID, our new approach for co-generating sequence and all-atom protein structures by sampling from the latent space of ESMFold. This requires only sequences during training, which unlocks more data and annotations:
bit.ly/plaid-proteins
🧵