David Kelley (@drkbio) Bsky

Will this recipe work for other organisms? We think it depends on genome size and proportion of nucleotides under selection, which drives the value of the self-supervised stage and training data scale. An exciting question for future work!

6 months ago 2 0 0 0

This was a massive effort, driven by the incredible work of Calico intern Kuan-Hao Chao (@kuanhaochao.bsky.social
). Huge thanks to him, Majed Mohamed Magzoub, and Johannes Linder!

6 months ago 0 0 1 0

My take: While MPRAs are powerful, they lose vital genomic context like local chromatin and post-transcriptional regulation. For modeling complex gene regulation in vivo, models trained on endogenous sequences are essential.

6 months ago 0 0 1 0

Each wins on its “home field”:
* MPRA-trained models excel at predicting MPRA data, including variant sequences.
* Shorkie excels at predicting expression from promoters in their natural genomic context and eQTLs.

6 months ago 0 0 1 0

How does Shorkie compare to models trained on massively parallel reporter assays (MPRAs)?

6 months ago 0 0 1 0

This translates to variant effect prediction where Shorkie accurately predicts the impact of cis-eQTLs, outperforming alternative models at classifying influential regulatory variants.

6 months ago 0 0 1 0

Shorkie also captures dynamic regulatory changes. Using new time-course RNA-seq data from TF inductions, we showed Shorkie can track how the importance of specific TF motifs changes over time.

6 months ago 0 0 1 0

This pre-training strategy makes a huge difference. Shorkie substantially outperforms the same model trained from scratch, boosting gene-level expression prediction from a Pearson's R of 0.74 to 0.88.

6 months ago 1 0 1 0

But which genomes work best? We trained on different phylogenetic levels, from close S. cerevisiae strains to the fungal kingdom. The Saccharomycetales order was the sweet spot, providing the right balance of diversity and conserved regulatory grammar for the model to learn from.

6 months ago 0 0 1 0

Our hypothesis: Jumpstart supervised learning with self-supervision--before predicting chromatin and expression, we first asked our model to predict masked-out nucleotides across many related genomes, so it learns conserved elements like genes and their promoters.

6 months ago 0 0 1 0

However, yeast's small genome provides limited data, making it tough for deep learning models to learn complex regulatory rules from scratch.

6 months ago 0 0 1 0

At Calico, we've been studying S. cerevisiae for years to understand replicative aging. Along the way, we've generated rich datasets to probe its regulatory networks, which helped make this work possible.

6 months ago 0 0 1 0

Predicting dynamic expression patterns in budding yeast with a fungal DNA language model Predicting gene expression from DNA sequence remains challenging due to complex regulatory codes. We introduce a masked DNA language model pretrained on 165 fungal genomes closely related to budding y...

Excited to share our new paper on predicting gene expression in yeast! We introduce "Shorkie," a supervised ML model that builds off a self-supervised foundation to interpret regulatory DNA.
Preprint: www.biorxiv.org/content/10.1...

6 months ago 9 6 1 1

AI in Molecular Biology | Keystone Symposia Join us at the Keystone Symposia on AI in Molecular Biology, September 2025, in Santa Fe, with field leaders!

The poster abstract deadline for the @keystonesymposia.bsky.social AI in Molecular Biology meeting in Santa Fe is coming up on August 21st, so get your submissions in!

www.keystonesymposia.org/conferences/...

8 months ago 2 1 0 0

borzoi-paper/extensions/prime at main · calico/borzoi-paper Analyses related to the Borzoi paper. Contribute to calico/borzoi-paper development by creating an account on GitHub.

We’ve done some experiments, but the metrics aren’t conclusive, so choose your own adventure! We’ve released these models open source, open weight for all to use. github.com/calico/borzo...

8 months ago 2 0 0 0

We hypothesized that training with cell-type-specific and 3' data might make these models particularly effective for transfer to datasets with similar properties.

8 months ago 1 0 1 0

Parameter-Efficient Fine-Tuning of a Supervised Regulatory Sequence Model DNA sequence deep learning models accurately predict epigenetic and transcriptional profiles, enabling analysis of gene regulation and genetic variant effects. While large-scale training models like E...

Transfer learning has emerged as a key application for multitask sequence models like these. For more, check out another recent paper from Han Yuan, whose analysis explores various transfer strategies and shows how powerful this approach can be. www.biorxiv.org/content/10.1...

8 months ago 0 0 1 0

Hence the name: Borzoi Prime to emphasize their 3’ expertise!

8 months ago 0 0 1 0

Indeed, he discovered the new models better predict alternative polyadenylation and QTL variants that affect where transcripts get cleaved and polyadenylated. This key regulatory layer influences cell type-specific protein production.

8 months ago 0 0 1 0

Drawing on his expertise and interest in isoform regulation, Johannes hypothesized that single-cell RNA-seq’s 3’ sequencing protocols might reveal additional capabilities in these models.

8 months ago 0 0 1 0

Using single cell eQTL studies, he evaluated the cell type specific variant effect predictions and found good concordance.

8 months ago 0 0 1 0

As cell-type-specific applications emerged, Johannes Linder took a fresh look.

8 months ago 0 0 1 0

We trained these models in early 2023 (which is why they’re algorithmically similar to the originals), but initial metrics were underwhelming, so we shelved them.

8 months ago 0 0 1 0

Side note—want your amazing data included in future training runs of open source, open weight models? Make and release BigWig tracks!

8 months ago 0 0 1 0

We curated several cell atlas collections to produce pseudobulk coverage tracks. Thank you to the CZI Tabula projects and the BICCN Brain Cell Atlas for making this possible!

8 months ago 0 0 1 0

A limitation of the first Borzoi training run was the absence of cell type specific RNA-seq tracks; most are heterogeneous bulk samples.

8 months ago 0 0 1 0

Predicting cell type-specific coverage profiles from DNA sequence Predicting expression profiles from RNA-seq experiments provides a powerful approach for universal sequence-based variant effect prediction, enabling researchers to score variants that affect total ge...

We’re excited to share a follow-up Borzoi training run and an analysis of the capabilities that emerged. www.biorxiv.org/content/10.1...

8 months ago 3 0 1 0

Alongside the manuscript and analysis, we released Borzoi predictions for 19.5 million common and low-frequency UK Biobank variants. Code for scoring additional variants with Borzoi is available here: github.com/calico/baske...

9 months ago 2 0 0 0

Moving forward, we suspect there are further improvements available. The Borzoi predictions cover most body tissues, but they aren’t yet zoomed into specific cell types. Alternative nonlinear heritability models may usurp S-LDSC for fitting variant priors.

9 months ago 2 0 1 0

Generally, we found that Borzoi predictions improve fine-mapping clarity and gene prioritization. We’re using Sniff to better analyze aging-related trait GWAS at Calico.

9 months ago 1 0 1 0

Posts by David Kelley