Li Song (@mourisl) Bsky

A table showing the base statistics for four datasets: ATB/no dustbin, no unknown; ATB/Dustbin; ATB/Salmonella Enterica; and ATB/Escherichia Coli. For each dataset, the table provides the size in GB, the number of references, and the number of kmers. The ATB/no dustbin, no unknown dataset is the largest at 130.51 GB with over 1.8 million references, while the ATB/Dustbin dataset has the highest kmer count at over 55 billion.

[Construction] With an experimental pipeline building directly to m-Fulgor, we indexed 98% of the AllTheBacteria v0.2 dataset (all species excluding unknown and dustbin) in 4 days. The resulting index takes just 130GB, for 1.8M genomes.
Figure: basic stats for some indexes. (4/6)

2 days ago 4 3 1 1

QCatch: A framework for quality control assessment and analysis of single-cell sequencing data AbstractMotivation. Single-cell sequencing data analysis requires robust quality control (QC) to mitigate technical artifacts and ensure reliable downstrea

QCatch is now published in Bioinformatics (academic.oup.com/bioinformati...)! Great work from Yuan and Dongze for quality control and analysis downstream of simpleaf/alevin-fry (taking advantage of its structured AnnData output). Give it a try: github.com/COMBINE-lab/...

5 days ago 18 4 0 0

GTDB - Genome Taxonomy Database The Genome Taxonomy Database (GTDB) is an initiative to establish a standardised microbial taxonomy based on genome phylogeny.

GTDB release 11 based on RefSeq 232 (R11-RS232) is live at gtdb.ecogenomic.org. This release covers 901,341 genomes (23% increase) and has 199,923 species clusters (39% increase). Release notes at: forum.gtdb.ecogenomic.org/t/announcing.... Release statistics at: gtdb.ecogenomic.org/stats/r232.

6 days ago 50 29 0 5

Congratulations!!!!!!!!!!

10 months ago 1 0 1 0

Introns have to come from somewhere, right? @celineh2ooo.bsky.social and I looked at multiple genome alignments with 1000s of genomes and found 342 cases where humans (and our relatives) had gained a new intron. Still not sure where these come from, but it's a fascinating question

10 months ago 42 12 2 1

Neng Huang developed longcallR for joint SNP calling and phasing from long RNA-seq reads, AND for identifying allele-specific splicing/junctions (ASJ). Although ASJs of statistical significance are rare, a large fraction involve unannotated junctions. In Rust!

10 months ago 16 7 0 0

Industry friends, now is the time for MUCH more speaking out on behalf of academic colleagues under duress. Here are core open source methods that many of your products doubtlessly depend on either directly or indirectly (see en.wikipedia.org/wiki/HMMER) being abruptly defunded. Make noise.

10 months ago 74 49 1 0

myloasm - metagenomic assembly with (noisy) long reads

Announcing myloasm, a new long-read (ONT R10/PacBio) metagenome assembler that I've been working on during my postdoc in the Heng Li lab (@lh3lh3.bsky.social).

myloasm-docs.github.io

10 months ago 132 78 5 3

Partitioned Multi-MUM finding for scalable pangenomics Pangenome collections are growing to hundreds of high-quality genomes. This necessitates scalable methods for constructing pangenome alignments that can incorporate newly-sequenced assemblies. We prev...

Excited to share a new update to Mumemto, scaling MUM and conserved element finding to any size pangenome! Preprint out now w/ @benlangmead.bsky.social.
Mumemto scales to the new HPRC v2 release and beyond, and can merge in future assemblies without any recomputation! 1/n

10 months ago 27 15 1 2

Centrifuger has updated the pre-built index list to include this exciting GTDB new release r226 for taxonomic classification of sequencing data: github.com/mourisl/cent.... There is also a gtdb+refseq human/virus/fungi/contaminants index, hopefully will be useful for human microbiome studies.

10 months ago 3 0 0 0

Great 🧵 by Pierre on the Kaminari paper! In short, Kaminari is a simple and elegant, but highly effective index for approximate colored k-mer queries. The simplicity leads to very fast query, but with accuracy consistent with (or exceeding) best-in-class solutions; a very fun collaboration indeed!

10 months ago 10 2 0 0

Efficient evidence-based genome annotation with EviAnn For many years, machine learning-based ab initio gene finding approaches have been the central components of eukaryotic genome annotation pipelines, and they remain so today. The reliance on these app...

Bioinformatics folks: check out our @biorxivpreprint on a new, very efficient and accurate system for automated genome annotation, EviAnn, led by my colleague Aleksey Zimin: www.biorxiv.org/content/10.1...

11 months ago 55 25 1 0

Congratulations!!!!

11 months ago 1 0 1 0

Inside UniProt Rich Epitope Information Comes to UniProt Mammalian immune responses are mediated by interactions between antigens and immune system compo...

Check out our latest collaboration with UniProt, who has integrated over 700,000 experimentally validated epitopes to enhance its protein entries with detailed immune response information. This data is accessible via the UniProt Feature Viewer and API! 💻🔬🧪 #collaboration #immunology #proteins

11 months ago 2 1 0 0

WABI 2025 WABI Conference on Algorithms in Bioinformatics

The deadline for WABI 2025 has been extended (but is still rapidly approaching) wabiconf.github.io/2025/

* abstract deadline: May 12 (AoE)
* paper deadline: May 15 (AoE)

Consider submitting your exciting algorithmic bioinformatics work to the WABI conference!

11 months ago 10 11 0 2

Thank you!

11 months ago 0 0 0 0

Forgot to dustmasker the genomes before creating a Centrifuger index and indeed saw some misclassifications. Took a while to figure out and lessons learned... Need to implement a built-in masking step like Kraken2 in case forget doing it in the future..

11 months ago 0 0 1 0

Parsing GTF and FASTA files using the eccLib Library Summary: Leveraging the Python/C API, eccLib was developed as a high-performance library designed for parsing genomic files and analysing genomic contexts. To the best of the authors' knowledge, it…

Parsing GTF and FASTA files using the eccLib Library www.biorxiv.org/content/10.1... 🧬🖥️🧪 gitlab.platinum.edu.pl/eccdna/eccLib

11 months ago 6 2 0 1

GitHub - ArcInstitute/xsra: An efficient CLI to extract sequences from the SRA An efficient CLI to extract sequences from the SRA - ArcInstitute/xsra

Extracting @NCBI SRA files with fasterq-dump can require 17x the size of the accession while decompressing. Our new tool xsra extracts sequences at 5x throughput with significantly less disk usage, built-in compression, and optional BINSEQ outputs

github.com/arcInstitute...

11 months ago 40 15 2 1

AllTheBacteria

Small update from AllTheBacteria (allthebacteria.org). Assemblies can be bulk downloaded from OSF as before, or you can now get individual assemblies from AWS. We now also have a LexicMap index on AWS, so you can align your favourite gene against 2.4million bacteria (next post for price estimates)

11 months ago 47 23 1 2

The Department of Human Genetics at the University of Utah is sponsoring the Rising Stars in Genetics and Genomics symposium!

- We are seeking nominations bu June 1.
- September 18-19, 2025
- Please share with the star postdocs that you know.

docs.google.com/forms/d/e/1F...

11 months ago 54 43 1 1

The sequence analysis session of #RECOMB2025 is off to a great start with @jimshaw.bsky.social presenting devider, a new algorithm for haplotyping small sequences from long-read sequencing.

www.biorxiv.org/content/10.1...

11 months ago 26 6 1 0

If you want to check if a human gene has copy-number changes or lands in a complex region, try pangene.bioinweb.org. Recently updated with more and better assemblies.

11 months ago 44 14 2 0

Time to build a new index!!

11 months ago 7 1 0 0

Short RNA-seq read alignment with minimap2

Minimap2-2.29 released with the support of short RNA-seq read alignment. More explanation and results here: lh3.github.io/2025/04/18/s...

1 year ago 29 7 0 0