Advertisement · 728 × 90
#
Hashtag
#gi2025
Advertisement · 728 × 90
Post image Post image

Had an amazing time at the CSHL Genome Informatics 2025 conference!
Presented my poster: “Investigating quiescence-specific roles of FUN30 in chromatin compaction” 🧬
@chrstne.bsky.social
#GI2025 #Chromatin #Genomics #Quiescence

10 1 0 0
Post image

Next, Johanna von Wachsmann
@johannavw.bsky.social presented Gemsparcl—Rapid and consistent genome clustering for navigating bacterial diversity with millions of genomes" #GI2025

3 0 0 0
Post image Post image

Anshul Kundaje @anshulkundaje.bsky.social presented "Deep learning models of regulatory DNA—A comparison of model
design choices" #GI2025 He focused more on task-specific models & showed multi-task models lack causal interpretability.
ChromBPNet: doi.org/10.1101/2024.12.25.630221
encodeproject.org

1 0 1 0
Post image Post image

Today is the last day of Genome Informatics #GI2025. The first talk was presented by Genrietta Yagudayeva on "A reproducible RNA-seq pipeline for mitogenomics and barcoding phylogenetics in neglected biodiversity"

3 0 1 0
Post image Post image Post image

Ryan Moreno from @sroyyors.bsky.social Lab presented "Integrating single-cell omics data across species using matrix factorization regularized by gene-level phylogenies"

A very cool application of gene orthology across species for single-cell expression analysis! #orthology #GI2025 #singlecell

3 0 1 0

Really excited to see our new work in scaling Mumemto to any size pangenome published in Genome Research this morning. And right on cue with the great opportunity to present this work at #GI2025 this week.

16 5 1 0
Post image Post image

Li Song @mourisl.bsky.social presented "Quality control of single-cell ATAC-seq data without peak calling using Chromap"

Chromap-QC: biorxiv.org/content/10.1101/2025.07.15.664951
Chromap: nature.com/articles/s41467-021-26865-w #GI2025

1 0 1 0
Post image

Hanchen Wang presented "Biomni—A general-purpose biomedical AI agent" #GI2025
doi.org/10.1101/2025.05.30.656746

1 0 2 0

Third day covered! Thank you Sina #GI2025

3 0 0 0

Next, Jacob H. Wynne presented "Integrating targeted experiments with deep learning to resolve biogeochemical mechanisms in an antarctic microbiome" #GI2025

2 0 1 0
Post image

#GI2025 Vikram Shivakumar from Ben Langmead's lab (@benlangmead.bsky.social) presents "MumemtoM - partitioned Multi-MUM finding for scalable pangenomics ". Now published in Genome Research @genomeresearch.bsky.social. Read full text here ➡️ tinyurl.com/Genome-Res-2...

10 5 0 1
Post image Post image OCR Ortholog Open Chromatin Status Prediction Framework Overview. a We trained a convolutional neural network (CNN) for predicting brain open chromatin using sequences underlying brain open chromatin region (OCR) orthologs in a small number of species and used the CNN to predict brain OCR ortholog open chromatin status across the species in the Zoonomia Consortium. Specifically, we used the sequences underlying the orthologs for which we have brain open chromatin data to train a CNN for predicting open chromatin. Then, we used the CNN to predict the probability of brain open chromatin for all brain OCR orthologs; predictions are illustrated on the right. Animals for which we do not have open chromatin data are in dark gray instead of black to indicate that their brain open chromatin is imputed. While we cannot evaluate the accuracy of most of our predictions, obtaining open chromatin data from most tissues in most species is infeasible, so predictions might be the best OCR annotations that we can obtain. b To demonstrate that our models can accurately predict whether sequence differences between species are associated with open chromatin differences, in addition to the evaluations described in previous work [57], we evaluated our performance on species-specific open chromatin for a species not used in model training and clade-specific open and closed chromatin for clades not used in model training. Since such regions often comprise a minority of OCR orthologs, models could obtain good overall performance while obtaining poor performance on such regions. We also evaluated our performance on tissue-specific open and closed chromatin for a tissue not used in model training, where we expect models to predict 0 if model learns sequence signatures related to the tissue used in training. c Full mouse test set and lineage-specific OCR accuracy evaluations for mouse sequence-only brain model, illustrating that, even for the best of these models,

OCR Ortholog Open Chromatin Status Prediction Framework Overview. a We trained a convolutional neural network (CNN) for predicting brain open chromatin using sequences underlying brain open chromatin region (OCR) orthologs in a small number of species and used the CNN to predict brain OCR ortholog open chromatin status across the species in the Zoonomia Consortium. Specifically, we used the sequences underlying the orthologs for which we have brain open chromatin data to train a CNN for predicting open chromatin. Then, we used the CNN to predict the probability of brain open chromatin for all brain OCR orthologs; predictions are illustrated on the right. Animals for which we do not have open chromatin data are in dark gray instead of black to indicate that their brain open chromatin is imputed. While we cannot evaluate the accuracy of most of our predictions, obtaining open chromatin data from most tissues in most species is infeasible, so predictions might be the best OCR annotations that we can obtain. b To demonstrate that our models can accurately predict whether sequence differences between species are associated with open chromatin differences, in addition to the evaluations described in previous work [57], we evaluated our performance on species-specific open chromatin for a species not used in model training and clade-specific open and closed chromatin for clades not used in model training. Since such regions often comprise a minority of OCR orthologs, models could obtain good overall performance while obtaining poor performance on such regions. We also evaluated our performance on tissue-specific open and closed chromatin for a tissue not used in model training, where we expect models to predict 0 if model learns sequence signatures related to the tissue used in training. c Full mouse test set and lineage-specific OCR accuracy evaluations for mouse sequence-only brain model, illustrating that, even for the best of these models,

Third day of Genome Informatics #GI2025 began with an exciting session on “AI, ML and Integrative Genomics” chaired by Irene Kaplow & Thomas Pierrot.
The first talk, by Irene Kaplow, focused on Challenges in Predicting Enhancer Activity Differences Between Species
doi.org/10.1186/s12864-022-08450-7

14 2 1 1

We are all sad that we didn’t have the opportunity to hear Adam Phillippy @aphillippy.bsky.social speak at the Genome Informatics #GI2025 due to the current circumstances. It would have been fascinating to learn about his work on the evolution of human acrocentric chromosomes.

5 0 1 0
https://arxiv.org/abs/2503.17547
Learning Multi-Level Features with Matryoshka Sparse Autoencoders

https://arxiv.org/abs/2503.17547 Learning Multi-Level Features with Matryoshka Sparse Autoencoders

Second's day concluded by fantastic talk by Cristina Martin Linares on "Minimal reconstruction of SpliceAI using distilled matryoshka sparse autoencoders"

They showed that matryoshka SAEs arxiv.org/abs/2503.17547 improves upon openSpliceAI elifesciences.org/reviewed-preprints/107454. #GI2025

4 1 1 0
Post image Post image a, Left—Fasta representation of an individual SARS-CoV-2 genome consists of sample name followed by the entire ≈ 30 kbp genome sequence. Right—MAPLE format records only the differences between the genome under consideration and a reference; columns represent the variant character observed, the position along the genome and (when necessary) the number of consecutive positions for which the character is observed. b, Left—an example likelihood vector at an internal node of a phylogenetic tree (shown by the narrow blue arrow; only a small portion of the tree is shown); for simplicity, we show only ten genome positions. At each position (rows 1–10), each column contains the likelihood for a specific nucleotide. For rows 1–9, the likelihood is concentrated at only one nucleotide (highlighted in green), while for position 10, we show an example with more uncertainty. Right—MAPLE representation of these node likelihoods. Assuming that the reference sequence at the first nine positions matches the most likely nucleotides in the vector (ATTAAAGGT), then for positions 1–9, the likelihood of nonreference nucleotides is negligible and we represent the likelihoods with a single symbol (R). At position 10, due to non-negligible uncertainty, we explicitly calculate and store the four relative likelihoods. c, Examples of likelihood calculation steps in MAPLE. Red arrows represent the flow of information from the tips to the root of the tree. Left—if two child nodes are in reference state R for a region of the genome (here, positions 1–9), then MAPLE assumes that their parent is also in state R. Right—if at a genome position (here, position 10), two child nodes have likelihoods concentrated at different nucleotides, then for their parent, we explicitly calculate the relative likelihoods of all four nucleotides.

a, Left—Fasta representation of an individual SARS-CoV-2 genome consists of sample name followed by the entire ≈ 30 kbp genome sequence. Right—MAPLE format records only the differences between the genome under consideration and a reference; columns represent the variant character observed, the position along the genome and (when necessary) the number of consecutive positions for which the character is observed. b, Left—an example likelihood vector at an internal node of a phylogenetic tree (shown by the narrow blue arrow; only a small portion of the tree is shown); for simplicity, we show only ten genome positions. At each position (rows 1–10), each column contains the likelihood for a specific nucleotide. For rows 1–9, the likelihood is concentrated at only one nucleotide (highlighted in green), while for position 10, we show an example with more uncertainty. Right—MAPLE representation of these node likelihoods. Assuming that the reference sequence at the first nine positions matches the most likely nucleotides in the vector (ATTAAAGGT), then for positions 1–9, the likelihood of nonreference nucleotides is negligible and we represent the likelihoods with a single symbol (R). At position 10, due to non-negligible uncertainty, we explicitly calculate and store the four relative likelihoods. c, Examples of likelihood calculation steps in MAPLE. Red arrows represent the flow of information from the tips to the root of the tree. Left—if two child nodes are in reference state R for a region of the genome (here, positions 1–9), then MAPLE assumes that their parent is also in state R. Right—if at a genome position (here, position 10), two child nodes have likelihoods concentrated at different nucleotides, then for their parent, we explicitly calculate the relative likelihoods of all four nucleotides.

Nicola De Maio presented "Maximum likelihood phylogenetics at pandemic scales" and discussed the importance of scalable phylogenetics in genomic epidemiology. #GenomeInformatics #GI2025
MAPLE: nature.com/articles/s41588-023-01368-0

1 1 1 0

#GI2025 Ilias Georgakopoulos-Soares presents "Quadrupia - a comprehensive catalog of G-quadruplexes across genomes from the tree of life". Now published in Genome Research @genomeresearch.bsky.social Read full text here ➡️ tinyurl.com/Genome-Res-2...

4 2 0 0
Post image

#GI2025 Chirag Jain presents "Pangenome-based genome inference using integer programming". Now published in GenomeResearch @genomeresearch.bsky.social Read the full text here ➡️ tinyurl.com/Genome-Res-2...

5 1 0 0
Post image

#GI2025 Mile Sikic @msikic.bsky.social presents "Geometric deep learning framework for de novo genome assembly" Now published in GenomeResearch @genomeresearch.bsky.social Full text here ➡️ tinyurl.com/Genome-Res-2...

10 4 0 0

And if you're not at Genome Informatics, of course you can have the conversation about these topics here.

#GI2025 (5/5)

1 0 0 0

If you're at Genome Informatics #GI2025 this week, be sure to stop by the 3 Galaxy posters to chat with Delphine, Mike and Anton about great new things in Galaxy. (1/5)

6 2 1 0

Thread on #GI2025 's second day! 👇🏻

11 5 0 0
Post image Post image

Haonan Wu gives a talk on "A k-mer-based estimator of the substitution rate between repetitive sequences"
www.biorxiv.org/content/10.1...
This work tackles the issue of Mash which ignores repeats in the genome, providing better distance estimation #GI2025

9 4 1 0
Post image Post image Post image

Harun Mustafa presents "Efficient, accurate, SRA-scale indexing and query" at #GI2025
MetaGraph is a highly compressed representation
of all public biological sequences!
nature.com/articles/s41586-025-09603-w
Try it online: metagraph.ethz.ch/search

7 2 1 0
Post image Abstract:
Seed-chain-extend with k-mer seeds is a powerful heuristic technique for sequence alignment used by modern sequence aligners. Although effective in practice for both runtime and accuracy, theoretical guarantees on the resulting alignment do not exist for seed-chain-extend. In this work, we give the first rigorous bounds for the efficacy of seed-chain-extend with k-mers in expectation. Assume we are given a random nucleotide sequence of length ∼n that is indexed (or seeded) and a mutated substring of length ∼m ≤ n with mutation rate θ < 0.206. We prove that we can find a k = Θ(log n) for the k-mer size such that the expected runtime of seed-chain-extend under optimal linear-gap cost chaining and quadratic time gap extension is O(mn^f(θ) log n), where f(θ) < 2.43 · θ holds as a loose bound. The alignment also turns out to be good; we prove that more than  1-o(sqrt(1/m)) fraction of the homologous bases is recoverable under an optimal chain. We also show that our bounds work when k-mers are sketched, that is, only a subset of all k-mers is selected, and that sketching reduces chaining time without increasing alignment time or decreasing accuracy too much, justifying the effectiveness of sketching as a practical speedup in sequence alignment. We verify our results in simulation and on real noisy long-read data and show that our theoretical runtimes can predict real runtimes accurately. We conjecture that our bounds can be improved further, and in particular, f(θ) can be further reduced.

Abstract: Seed-chain-extend with k-mer seeds is a powerful heuristic technique for sequence alignment used by modern sequence aligners. Although effective in practice for both runtime and accuracy, theoretical guarantees on the resulting alignment do not exist for seed-chain-extend. In this work, we give the first rigorous bounds for the efficacy of seed-chain-extend with k-mers in expectation. Assume we are given a random nucleotide sequence of length ∼n that is indexed (or seeded) and a mutated substring of length ∼m ≤ n with mutation rate θ < 0.206. We prove that we can find a k = Θ(log n) for the k-mer size such that the expected runtime of seed-chain-extend under optimal linear-gap cost chaining and quadratic time gap extension is O(mn^f(θ) log n), where f(θ) < 2.43 · θ holds as a loose bound. The alignment also turns out to be good; we prove that more than 1-o(sqrt(1/m)) fraction of the homologous bases is recoverable under an optimal chain. We also show that our bounds work when k-mers are sketched, that is, only a subset of all k-mers is selected, and that sketching reduces chaining time without increasing alignment time or decreasing accuracy too much, justifying the effectiveness of sketching as a practical speedup in sequence alignment. We verify our results in simulation and on real noisy long-read data and show that our theoretical runtimes can predict real runtimes accurately. We conjecture that our bounds can be improved further, and in particular, f(θ) can be further reduced.

Second day of Genome Informatics #GI2025 began with the session “Genome Assembly and Sequence Algorithms" Yun William Yu presented “Average-case Analysis of Seed-Chain-Extend under Random Mutations"
genome.cshlp.org/content/33/7/1175
providing theoretical guarantees for the popular seed-chain-extend

21 3 1 1

A thread on #GI2025 's first session 👇🏻

6 1 0 0

The session concluded with two talks:
"Adaptive short- to long-read alignment enables low-coverage de novo assembly at population scale" presented by Baris Ekim
and
"A long-read human pangenome initiative for comprehensive interpretation of nuclear-embedded mitochondrial DNA" by Lianting Fu #GI2025

2 0 0 0
Post image Post image Post image

Nicole Brown gave a fantastic talk on Identifying introgressions across pangenomes with Panagram

It uses k-mer conservation to annotate genomic variation across hundreds of genomes, followed by normalization of k-mer profiles to identify introgression events
github.com/kjenike/pana... #GI2025

9 5 1 0
Post image

Jacqueline Toussaint gave the talk on "Constructing pan-genome gene graphs with hundreds of thousands of bacterial genomes"
The pipeline reveals bacterial variations using PopPUNK to cluster isolates, followed by finding genes and graphing with ggCaller/Panaroo, on the great AlltheBacteria! #GI2025

9 2 1 0
Post image Post image Post image

The first session is PANGENOMES #GI2025. Alexander Schönhuth is delivering the first talk on "Generating synthetic genotypes using diffusion models"
Paper: academic.oup.com/bioinformati...
Code: github.com/TheMody/Gene...

2 1 1 1
Post image

@benlangmead.bsky.social kicking off the start of Genome Informatics! #gi2025 @cshlnews.bsky.social

26 3 1 0