🎉 Excited to share that the last paper of my PhD is now published in PRX Life!
We introduce RAG-ESM, a retrieval-augmented framework that makes pretrained protein language models (like ESM2) homology-aware with minimal training cost.
📄 Paper: journals.aps.org/prxlife/abst...
Posts by Cyril Malbranke
Protein-protein interactions studied by @cyrilmalbranke.bsky.social #PragueBioML @elixircz.bsky.social
[8/8] 💻 Resources:
• Training dataset
• 4 pre-trained models (XS → L)
• Code & interactive notebooks
🔗 huggingface.co/collections/...
🔗 github.com/Bitbol-Lab/P...
[7/8] 📊 In conclusion, results show strong performances across species and benchmarks for both PPI prediction and gene essentiality. ProteomeLM makes proteome-wide analysis more practical, easing large-scale studies, including in complex eukaryotic proteomes.
Gene essentiality, showing performance outperrforms ESM-C, and that the prediction are good on E.coli, S. cerevisae and minimal cells
[6/8] 🎯 Beyond PPIs: ProteomeLM predicts gene essentiality across diverse taxa (e.g. E. coli, yeast, minimal cells), highlighting its potential for broad downstream applications.
Barplot showing speed improvement over classical DCA methods
Number of predictions in function of recall to show performance leap from classical DCA methods to ProteomeLM on human interactome (0.73 -> 0.826 AUROC)
Performance on the D-SCRIPT dataset on four organisms for supervised PPI
[5/8] ⚡ This allows unsupervised and supervised PPI prediction at proteome scale in minutes, several orders of magnitude faster than coevolution-based methods such as DCA.
Try it here: github.com/Bitbol-Lab/P...
Heatmap plot showing that ProteomeLM attention heads can distinguish interacting vs non interacting pairs in E.coli, S. cerevisiae, H. sapiens
[4/8] 🎯 Key finding: Attention heads spontaneously encode protein–protein interaction networks. Some heads can reach an AUC of 0.92 in discriminating interacting vs non-interacting pairs.
[3/8] 🧬 Encoding strategy: Instead of positional encoding, ProteomeLM introduces a functional encoding based on orthologous groups. Thus the model can leverage functional encoding and other proteins. This is especially important in eukaryotes, where gene order is less conserved.
Figure 1. ProteomeLM Architecture
[2/8] 🧬 Training objective: ProteomeLM uses a custom masked language modeling task, predicting masked ESM-C representations of proteins within the proteome.
[1/8] 📄 New preprint! With Gionata Paolo Zalaffi & Anne-Florence Bitbol, we introduce ProteomeLM, a transformer that processes entire proteomes (prokaryotes and eukaryotes), enabling ultra-fast protein–protein interaction (PPI) prediction across the tree of life.
🔗 www.biorxiv.org/content/10.1...