Excited to share our new paper @nature.com! We developed PerturbFate, a scalable single-cell platform to discover how diverse genetic perturbations converge on a shared drug-resistant cell state, and key programs driving it, led by our incredible Zihan Xu from @rockefeller.edu
rdcu.be/fdC14
Posts by cxqiu
CREsted is finally out! You can find the article, together with a summarizing Research Briefing, in thread. π¦
Latest from Shendure & Qiu labs (@cxqiu.bsky.social)
)! We combined a new 4M cell mouse whole embryo scATAC-seq atlas (E10-P0), millions of 'evolutionarily coherent' orthologs from 241 mammalian genomes (Zoonomia), and the CREsted CNN framework (@steinaerts.bsky.social).
We thank Q-T-Ο (Canis familiaris), Tater and Tot (Feline catus) for inspiration. Nothing in biology makes sense except in the light of evolution β apparently including AI.
Huge team effort led by CX Qiu, Riza Daza, and Ian Welsh. Jay Shendure supervised the project. Key contributions from Rupali Patwardhan, @niklaskemp.bsky.social & @steinaerts.bsky.social (CREsted), built on our mouse timelapse with @coletrapnell.bsky.social. Grateful to the whole team and Zoonomia.
Everything is open: interactive preprint, count matrices, models, all 7,712 prediction tracks, code & reproducible figures β doi.org/10.62329/hxkk6249. Raw data: GEO GSE325776. Code: github.com/ChengxiangQiu/jax-atac-code
Limitations we're upfront about: promoter suppression is still heuristic, some species bias remains, and all labels derive from mouse. Matched atlases in a few more species would help a lot. This is v1 β substantial headroom remains.
Model organisms aren't just for cataloging biology β they're training substrates for AI models of human biology. Mouse experimental depth + mammalian sequence diversity = virtual access to human regulatory landscapes we can't profile directly.
We applied STEAM to all 241 Zoonomia genomes: 32 Γ 241 = 7,712 genome-wide enhancer tracks. HumMus for human + mouse. BabaGanoush for the full spread!
Some favorites β human enhancers with no mouse ortholog, validated by fetal accessibility:
> FECH intron 1 (erythroid, heme biosynthesis)
> upstream of TFRC (erythroid, iron uptake)
> upstream of APOB (hepatocyte, LDL cholesterol)
> upstream of CYP2C19 (hepatocyte, drug metabolism)
Even for human enhancers with NO mouse ortholog at all, STEAM predicts the right cell class. Hepatocyte-predicted elements are more accessible in human hepatoblasts; erythroid-predicted elements in erythroblasts. 7Γ difference in the expected direction.
Key validation: human-only predicted enhancers are 8β9Γ more accessible than mouse-only predictions in the corresponding human fetal cell type β using Domcke et al. human fetal accessibility data the model never saw. Evolutionary transfer learning works.
We apply STEAM genome-wide to human + mouse: ~340K enhancers per species across 32 cell classes. Jaccard co-occurrence of enhancer predictions recovers nearly identical lineage structure in both species β regulatory logic is deeply shared.
STEAM resolves 11 synteny groups of hepatocyte enhancers at this locus β orthologous enhancer families with shared ancestry but divergent sequences. Some are deeply conserved, others lineage-restricted (e.g. one group found only in Old World monkeys).
The payoff: at the Afp locus, hepatocyte enhancer predictions jump from 1.2/species to 4.6/species across 136 mammals. Signal-to-noise goes from 3Γ to 15Γ. The Mus-restricted bias largely disappears.
Performance scales with phylogenetic breadth, plateauing ~32 species β but even partial inclusion accelerates convergence dramatically.
Enter STEAM: we augment training with syntenic enhancer orthologs from up to 241 Zoonomia genomes β a ~200Γ expansion in sequence diversity, preserving cell-class labels. Orthologous enhancers are nature's data augmentation: divergent sequences, shared function.
But the evolution-aware model doesn't transfer across species. At the Afp locus, hepatocyte predictions light up in Mus and go dark elsewhere. The culprit: insufficient sequence diversity from training on one genome.
These predicted enhancers inform gene expression: stronger enhancer predictions near a gene β higher cell-class-specific expression, with clear distance-dependent decay. This holds across all 32 developmental lineages.
Compare the Alb/Afp locus: the evolution-naive model predicts promoters, tandem-repeat artifacts, and real enhancers. The evolution-aware model distills this to six clean hepatocyte-specific elements, trimmed to core regions of 154β474 bp.
Filtering yields 32 cell-class-specific enhancer clusters that map one-to-one onto developmental lineages, plus one large promoter cluster β cleanly separated. The evolution-aware model trained on these eliminates both failure modes.
Step three: use evolution to clean up. Real enhancers should be syntenically retained across mammals AND show coherent predicted activity across orthologs. Both filters yield clean bimodal distributions β nature's quality control.
But when we tile the WHOLE genome? Two failure modes: tandem repeats generate massive false positives (24Γ enrichment), and promoter grammar contaminates distal enhancer predictions. Strong performance on peaks β reliable genome-wide inference.
Step two: train CREsted to predict cell-class-specific accessibility from DNA sequence. Strong performance on held-out peaks (r = 0.74), with lineage structure clearly recovered.
The atlas integrates tightly with our matched scRNA-seq timelapse (11M cells, same embryo cohort). Nearest neighbors in the co-embedding land right on matching timepoints.
Step one: the atlas. 3.9M nuclei by sci-ATAC-seq3 from 36 whole mouse embryos, one per 6-hour bin, E10 to P0. No dissection. We resolve 13 lineages β 36 cell classes β 140 cell types across the full arc of organogenesis.
The core idea: cis-regulatory sequences evolve fast, but the trans-acting programs that read them evolve slowly. This mismatch β the same principle that powered AlphaFold β means models trained on mouse enhancers should generalize to orthologous cell types across Mammalia.
New preprint @cxqiu.bsky.social @jshendure.bsky.social ! Can we learn regulatory grammars of human cell types β by training on mouse development and transferring across 241 mammalian genomes? Introducing STEAM & a whole-organism scATAC-seq atlas from E10 to birth.
www.biorxiv.org/content/10.6...
Grateful to my mentors @jshendure.bsky.social, @coletrapnell.bsky.social, @cbmoens.bsky.social, @wnoble.bsky.social, Bob Waterston, @ksusztak.bsky.social , and Qinghua Cui for their guidance and support along the way π !!
Thrilled to share Iβve started my lab at Dartmouthβs Geisel School of Medicine! We focus on mapping cellular trajectories & TF networks in development and Mendelian disorders, exploring new therapies. Join usβpostdocs, grads, and scientists welcome! sites.dartmouth.edu/qiulab/