New paper showing that much of the apparent success of protein language models in predicting mutational effects is a mirage: These models mostly memorize sites. 1/
www.biorxiv.org/content/10.6...
Posts by Dmitry Penzar
The key ingredient of our solution was MPRA-LegNet, but we also incorporated a large number of new ideas to master the challenge.
It’s inspiring that the second-place team also used LegNet as the basis for their solution.
More details to come
Our team achieved first place in the CAGI7 lentiMPRA challenge on predicting the effects of single-nucleotide mutations in regulatory elements, surpassing the nearest competitors by a significant margin.
(13/13) In turn, the wider set of data for Final TFs remains suitable for offline benchmarking with the open-source bibis framework (github.com/autosome-ru/...). The whole story can be found on bioRxiv: doi.org/10.1101/2025....
(12/13) The online Leaderboard benchmarking platform, including the preprocessed data, benchmarking protocols, and rich documentation, remains fully functional and accessible online (ibis.autosome.org) to facilitate development of the future TFBS models.
(11/13) However, those changes did not translate into better prediction of SNP effects. Additionally, pre-initialization of the first convolutional layers with the best available PWMs for the corresponding TFs didn't yield any notable performance gain.
(10/13) We conducted ablation studies on LegNet. Minor modifications, such as replacing global average pooling with global max pooling in the SE block, led to substantial performance gains, making the resulting model the best in the post-challenge assessment.
(9/13) Post-challenge analysis added extra DL models: top models from the DREAM challenge and popular architectures unused in IBIS, including Malinois and DNA language models. Fine-tuned DNA LMs performed far worse than fully supervised approaches.
(8/13) TF-binding models can be used to predict the effect of single-nucleotide variants. In A2G, PWMs performed unexpectedly well, e.g. MEX secured 2nd place. In G2A, the original top triple-A models dominated, followed by MEX and RSAT — the strongest PWM-based approach.
(7/13) Yet, several deep learning approaches (DL) failed substantially in cross-experiment validation – in some cases performing far worse than PWMs. Unlocking the full potential of DL clearly requires careful architectural and training design.
(6/13) Performance of the solutions varied substantially across TFs and experimental platforms. The top-scoring ML models outperformed PWM-based IBIS solutions from the competition and our PWM baseline from Codebook MEX (x.com/VorontsovIE/...).
(5/13) Once again, we congratulate the runner-up teams (Medici, Salimov & Frolov lab, callitmagic), and the winners (Bench Pressers, mj, and Biology Impostor) (x.com/halfacrocodi...)
(4/13) Participants employed a wide range of methods from classic motif discovery with position-specific weight matrices (PWMs) to arbitrary advanced approaches (triple-As), including CNNs, RNNs, gradient boosting, and even more exotic approaches.
(3/13) For the first time, the IBIS Challenge assessed in depth the transferability of DNA motif models from artificial to genomic sequences (A2G), and vice versa (G2A), with rigorous test-train splits, multiple performance metrics, and transparent ranking system.
(2/13) TFs orchestrate transcriptional programs by recognizing short DNA motifs. The long-standing goal is to develop reliable models of TFs' DNA binding specificities and avoid biases of particular experimental assays (x.com/halfacrocodi...).
(1/13) Excited to share the outcome of the IBIS Challenge! The IBIS challenge united dozens of teams across the world in tackling the problem of modeling transcription factor (TF) binding specificity using a diverse collection of experimental datasets for understudied human TFs.
Excited / nervous to share the “magnum opus” of my postdoc in Andreas Wagner’s lab!
"De-novo promoters emerge more readily from random DNA than from genomic DNA"
This project is the accumulation of 4 years of work, and lays the foundation for my future group. In short, we… (1/4)
Out in Cell @cp-cell.bsky.social: Design principles of cell-state-specific enhancers in hematopoiesis
🧬🩸 screen of fully synthetic enhancers in blood progenitors
🤖 AI that creates new cell state specific enhancers
🔍 negative synergies between TFs lead to specificity!
www.cell.com/cell/fulltex...
🧵
Finally published! We developed an epigenomics to therapeutics screening approach that identifies naturally occurring elements that can titrate expression of transgenes at various levels including single elements stronger than the B-globin LCR. www.nature.com/articles/s41...
Our preprint on designing and editing cis-regulatory elements using Ledidi is out! Ledidi turns *any* ML model (or set of models) into a designer of edits to DNA sequences that induce desired characteristics.
Preprint: www.biorxiv.org/content/10.1...
GitHub: github.com/jmschrei/led...
We share a lot of our ideas, code, datasets (that we spend years sanitizing) early. Often way before we release preprints. We do this so that others can use, build on, improve & even "beat" our approaches. But I want to say a few things about some simple expectations 1/
We wrote a review article on modelling and design of transcriptional enhancers using sequence-to-function models.
From conventional machine learning methods to CNNs and using models as oracles/generative AI for synthetic enhancer design!
@natrevbioeng.bsky.social
www.nature.com/articles/s44...
Super excited to announce our latest work. On a personal note, it's not an exaggeration to say that blood, sweat, and tears got us to the finish line on this: working w/ an outstanding global team of scientists in Germany, Japan, Russia, and USA responding in >100 pages of complex reviewer comments.
Finally out! We present EXTRA-seq, a new EXTended Reporter Assay to quantify endogenous enhancer-promoter communication at kb scale!
www.biorxiv.org/content/10.1...
A 🧵about what it can do:
#SynBio #DeepLearning #GeneRegulation
Wonderful.
Just two weeks ago I was explaining to a junior colleague the problem of exaggerated claims in science. This paragraph is exactly what should be printed in place of a user agreement when anybody submits a paper.
Join us for our next Kipoi Seminar with with Dmitry Penzar,
@pensarata.bsky.social @ autosome.org!
👉LegNet: parameter-efficient modeling of gene regulatory regions using modern convolutional neural network
📅Wed Dec 4, 5:30pm CET
🧬 kipoi.org/seminar/
(1/6) 🐦🔥 In IBIS #ibischallenge, we challenged teams from all over the world to decipher the DNA recognition code of human transcription factors. The IBIS Final Conference took place on November 27, 2024. Recordings and slides: disk.yandex.ru/d/82FEnwPn15...