This is just a trajectory being played forward in time! No reaction coordinate. We're using Langevin dynamics (essentially running iterative noising and denoising steps with a small time increment) to simulate coarse-grained dynamics instead of drawing i.i.d samples.
Posts by jproney
If you think this seems cool and useful, checkout our updated preprint and GitHub:
www.biorxiv.org/content/10.6...
github.com/jproney/Prot...
Additionally, we conducted an initial study on protein folding pathways! We ran direct Langevin simulations on Protein G, NuG2, and Protein L. We found that Protein G folds through the C terminus while NuG2 and Protein L are shifted toward the N terminus, which is consistent with experiment!
This difference is especially pronounced for proteins with few homologues, such as de novo proteins.
ProteinEBM-x gives a huge boost in performance, both in ranking and stability prediction. In particular, ProteinEBM-x achieves state-of-the-art results at zero-shot stability ranking in ProteinGym, outperforming PLMs with over 15x the parameters.
Diffusion models actually learn a series of time-indexed energy landscapes, which are corrupted with different amounts of noise. The ranking ability of ProteinEBM peaked slightly above t=0. Inspired by this finding we trained an "expert" model only on low time levels, which we call ProteinEBM-x.
To recap our original method, we used denoising score matching to train an energy-based model that approximates the free energies of protein conformations. We found that this worked well for ranking protein structures and predicting the effects of mutations on stability.
I'm excited to announce some major updates to our ProteinEBM paper with Chenxi Ou @sokrypton.org!
They are kidnapping young children and murdering citizens in the street in broad daylight and lying to us when we can see what they are doing. This will be in textbooks and generations from now will ask us what we stood for. 🖕🧊
Cool!
But for the CA representation used by our EBM, it seems like the entropy due to local backbone and side chain fluctuations might be much more tractable?
I might be wrong, but it seems like the difficulty of learning chain entropy would be very related to the level of coarse graining. If you’re looking at a reaction coordinate like fraction of native contacts, each macrostate contains a massive number of chain confrontations
Thanks for the insight! I’ll definitely keep this is mind for future analyses. For what it’s worth i might caution against interpreting the decoy ranking plot as a “folding landscape,” since all those decoys are basically folded, just in wrong conformations.
Also let me know if any of that sounds wrong! Feedback from someone with your level of MD/StatMech knowledge is very much appreciated 🙂
Good question! Ideally it should be a free energy because it learns the PMF over coarse-grained states (at the temperature of the MD data), integrating out fine-grained DOFs. To test if this is true in practice maybe we could examine systems where entropy over the fine DOFs plays a big role?
As a bonus, here's a video of ProteinEBM folding up the fast-folder NTL9, rendered in stunning 2D by py2Dmol from @sokrypton.org! We hope models like ProteinEBM can serve as a step toward solving the "real" protein folding problem.
If this sounds interesting to you, check out the preprint on bioRxiv: www.biorxiv.org/content/10.6....
All of our code and model weights are available at github.com/jproney/Prot.... Thanks for listening!
This research builds on my undergraduate work using AlphaFold2 for structure scoring. Compared to AF2Rank, ProteinEBM is more efficient and versatile, and has a firmer theoretical foundation. We see ProteinEBM as an important step toward developing physically-grounded ML models for protein science.
And finally, we combined large-scale ProteinEBM sampling with AF2Rank to create an ab initio structure prediction protocol that beats massive sampling from both AlphaFold2 and AlphaFold3 in the MSA-free regime.
When used for sampling fast-folding proteins, ProteinEBM produces energy funnels with minima very close to the native structures
ProteinEBM can also rank the effects of mutations on stability, with accuracy comparable to sequence-supervised models like ProteinMPNN, despite not being trained to predict sequences.
ProteinEBM performs very well at ranking the correctness of candidate protein structures, and compares favorably to Rosetta in terms of ranking correlations.
Introducing ProteinEBM: a fast, transferable Energy-Based Model for protein conformations. ProteinEBM is trained using energy-based score matching. After training on sequence-structure pairs and MD data, the model energy should match the log data density (i.e., the free energy landscape)!
A very general approach to protein modeling is the development of energy functions that describe protein conformational landscapes. With a good enough energy, you can use optimization to predict structures, simulate dynamics, estimate conformational preferences, predict stabilities, and more.
I'm super excited to announce the first preprint of my PhD, together with Chenxi Ou and @sokrypton.org!
ML has revolutionized protein modeling, but crucial challenges remain. For example, we can't reliably predict complicated protein structures without MSAs, which limits what we can design.
Trump’s war on science is an attack against anyone who has ever loved someone with cancer.
The American people do not want us to slash cancer research in order to give more tax breaks for billionaires.
Genuinely proud to be a Harvard alum today
And... So begins the death of science, technically and innovation in America...
www.wbur.org/news/2025/03...
Stand up for science flyer, happening on March 7th in DC & nationwide with more information at www.standupforscience2025.org
WHERE WILL YOU BE TOMORROW?
‘“id love to do some posts on here” -dril’- me.