Gaëtan Benoit (@gaetanbenoit) Bsky

Turns out that the usual NtHash is not as random as one might think?!?! At least not for minimizers.

Seq-hash (and simd-minimizers) already has this fixed by default ;)

github.com/rust-seq/seq...

22 hours ago 8 4 0 0

Breaking ntHash (to better fix it) NtHash is a popular method for hashing k-mers in bioinformatics, yet it has some surprising flaws. In this post, I walk through a few of them, and show that they can arise naturally, without an advers...

New blog post!

I use ntHash all the time to hash k-mers, yet it turns out it has some unexpected flaws (collision propagation, bias on leading zeros...). The good news: each of them can be fixed!

igor.martayan.org/posts/breaki...

1 day ago 22 11 0 2

Yes with checkm2, I think prevotella genome has two circular chromosomes, so here i think there are two different species. The two chromosomes have exact same coverage but I don't know if they have same 4mers profile.

1 day ago 0 0 0 0

Update: all the gray "circles" in the top left component are Prevotella chromosomes, so 4 complete chromosomes total but not any converted in complete MAGs.

1 day ago 1 0 1 0

I'll give you an update if I find the other chromosome in the graph 🕵️

1 day ago 2 0 1 0

I blasted-search it and it is actually a Prevotella, so one of its chromosome. The issue is that this chromosome is also alone after binning.
Thus this organisms might always be lost after the classical completeness/contamination filter. I don't know if its a common issue even with short reads?

1 day ago 0 0 1 0

I was investigating the genomes that I didn't manage to convert to near-complete MAGs in my assembly graph (the components in gray).

The circle on top left is actually a complete genome but with 40% completeness (both in metaMDBG and myloasm)

1 day ago 9 3 2 0

Two more days to submit your abstract for a short talk or poster at RECOMB-Seq 2026. See instructions at recomb-seq.github.io/seq2026/call...

3 days ago 10 7 0 0

This is from 7 years ago (merenlab.org/2019/02/24/f...). We are talking about the same things today. We will be talking about the same things 7 years from now. There is no one to blame for this apart from ourselves. I find it very depressing.

5 days ago 22 9 1 1

Fast and accurate multiple-protein-sequence alignment at scale with FAMSA2 - Nature Biotechnology FAMSA2 accurately aligns millions of protein sequences at high speed.

10 years after the first FAMSA paper, its successor is now published in Nat Biotech! We believe that FAMSA2 can enable analyses of large protein collections that were previously unattainable. Thank you, Andrzej and Cedric, for great collaboration
www.nature.com/articles/s41...

1 week ago 56 22 3 2

Accepted to ISMB'26. Revised paper is here: jermp.github.io/assets/pdf/p.... I'd like to thank @robp.bsky.social once again and all the received feedback from the reviewers. To me, ISMB has had the highest quality review process over the past few years!

1 week ago 12 6 1 0

GitHub - gbouras13/baktfold: Rapid & standardized genome annotation using protein structural information Rapid & standardized genome annotation using protein structural information - gbouras13/baktfold

Whenever I presented Phold, I was frequently asked "can you do the same beyond phages?" We ( @oschwengers.bsky.social @linsalrob.bsky.social @binomicalabs.org et al) finally did it with Baktfold github.com/gbouras13/ba... www.biorxiv.org/content/10.6...

2 weeks ago 56 22 1 2

Following up on this - MADRe is now officially published 🎉

Very grateful for the guidance of @msikic.bsky.social @rvicedomini.bsky.social and Kresimir Krizanovic

🔗 academic.oup.com/gigascience/...

2 weeks ago 9 6 1 1

Our work on 'hidden diversity' in unbinned contigs is now published in @natmicrobiol.nature.com :

www.nature.com/articles/s41...

See the linked threads for more details!

2 weeks ago 67 40 3 1

A run-length-compressed skiplist data structure for dynamic GBWTs supports time and space efficient pangenome operations over syncmers
doi.org/10.64898/202...

3 weeks ago 17 5 0 0

SNP calling, haplotype phasing and allele-specific analysis with long RNA-seq reads Nature Methods - In this study, long-read RNA sequencing achieves accurate single-nucleotide polymorphism calling, haplotype phasing and allele-specific expression analysis.

LongcallR for competitive SNP calling and haplotype phasing, and simplified allele-specific analysis with long RNA-seq reads. Found ~100 junctions affected by SNPs per sample with most junctions novel.

Developed by Neng Huang. Published in @natmethods.nature.com. Read at rdcu.be/faKhL

3 weeks ago 43 18 0 0

A quick rant on people vibe-translating our Rust libraries to other languages

That's the second time in a week that I see new bioinformatics tools with a vibe-coded translation of our Rust libraries to C/C++.

I have two major issues with that:

3 weeks ago 33 10 2 4

Congrats!!

3 weeks ago 2 0 1 0

Myloasm, our long-read metagenome assembler, is now published! w/ @mgmarin.bsky.social and @lh3lh3.bsky.social

Very rewarding after > a year of development and countless hours thinking about assembly. Thanks to beta testers, Li lab, and reviewers who gave very helpful feedback.

rdcu.be/famFj

3 weeks ago 98 56 4 1

Preprint alert!

TLDR:
Super Bloom is a Bloom-filter variant for streaming k-mer queries.
It uses minimizers to group adjacent k-mers into super-k-mers and map them to the same memory block.
Result: much better locality, faster queries, and with the findere trick, dramatically fewer false positives.

3 weeks ago 20 8 2 0

How much protein diversity can Life on Earth actually generate?

With DIAMOND DeepClust, we show how billions of proteins across the tree of life can be clustered at low-identity for downstream analytics tasks.

📚Paper: www.nature.com/articles/s41...
💻Code: github.com/bbuchfink/di...

4 weeks ago 64 29 1 0

_720 Gbp_ marine nanopore metagenome -> 328 circular prokaryotic contigs: using myloasm!

Insane work by Lui and Nielsen. Also shows how modern long read assemblies can disentangle coexisting strains and reveal ecological insights.

4 weeks ago 47 13 2 0

Recently amplified gene arrays are a super interesting phenomenon, but many still resist our attempts to assemble them. @dantipov.bsky.social has developed a new method (Trivial Tangle Traverser) that resolves assembly graph tangles caused by such sequences (1/4) www.biorxiv.org/content/10.6...

4 weeks ago 27 11 1 0

Long reads carry multiple small vars and SVs and their phasing. LongcallD is the only caller that tightly integrates germline/mosaic small/structural vars/MEIs and their phasing in a single C program. One command line to get competitive small variant calls and better SVs. Led by Yan Gao.

4 weeks ago 45 21 0 1

Super Bloom: Fast and precise filter for streaming k-mer queries www.biorxiv.org/content/10.64898/2026.03...

1 month ago 22 13 0 1

It's a good day when the first item in your feed is your own work :)

@rickbitloo.bsky.social was annoyed that scanning reads for all 96 rapid kit barcodes is bottleneck in Barbell, so he made Sassy2: 13x (150bp) to 4.6x (8kbp) faster than v1 by batch-searching patterns, and >100Gbp/s on 16 threads!

1 month ago 24 11 2 0

Water mass specific genes dominate the Southern Ocean microbiome Nature Communications - Southern Ocean microbial communities are less well studied. Here, the authors generate a circumpolar-scale gene catalog from 218 metagenomics samples revealing broadscale...

Very happy to share the latest paper of our group (et al.)! rdcu.be/e7zyX . This one has a special place… 1/n

1 month ago 21 7 1 1

Multi-context seeds enable fast and high-accuracy read mapping - Genome Biology A key step in sequence similarity search is to identify shared seeds between a query and a reference sequence. A well-known tradeoff is that longer seeds offer fast searches but reduce sensitivity in ...

1/ Our paper on Multi-Context Seeds is now out, with @tolyan.bsky.social spearheading the work and contributions from Nicolas and @marcelm.net. We introduce a new seeding concept that improves read alignment accuracy while maintaining speed.
link.springer.com/article/10.1...

1 month ago 19 12 1 0

Release SeqKit v2.13.0 (10-year-old birthday version) · shenwei356/seqkit Changelog SeqKit is 10 years old! SeqKit v2.13.0 - 2026-02-28 seqkit: add support for reading and writing LZ4 compression format. new command: seqkit sample2: improved seqkit sample by @stahiga....

Can't wait to release a 10-year-old birthday version for SeqKit!

- 10 years
- 2 papers, 3500 citations
- 20 contributors
- 40 subcommands
- 880 commits
- 500 issues
- 685.5K Bioconda total downloads

Thank you all, dear contributors and users!
I'll keep maintaining it.

github.com/shenwei356/s...

1 month ago 125 35 6 1

How would you design a *multithreaded*, *concurrent* & *dynamic* hash table if you are focused specifically on common k-mer workloads, where streaming query & insertion are common? Jamshed, Prashant and I explore this in kache-hash, a cache-friendly k-mer hash table!
www.biorxiv.org/content/10.6...

2 months ago 20 13 0 0

Posts by Gaëtan Benoit