Turns out that the usual NtHash is not as random as one might think?!?! At least not for minimizers.
Seq-hash (and simd-minimizers) already has this fixed by default ;)
github.com/rust-seq/seq...
Posts by Gaëtan Benoit
New blog post!
I use ntHash all the time to hash k-mers, yet it turns out it has some unexpected flaws (collision propagation, bias on leading zeros...). The good news: each of them can be fixed!
igor.martayan.org/posts/breaki...
Yes with checkm2, I think prevotella genome has two circular chromosomes, so here i think there are two different species. The two chromosomes have exact same coverage but I don't know if they have same 4mers profile.
Update: all the gray "circles" in the top left component are Prevotella chromosomes, so 4 complete chromosomes total but not any converted in complete MAGs.
I'll give you an update if I find the other chromosome in the graph 🕵️
I blasted-search it and it is actually a Prevotella, so one of its chromosome. The issue is that this chromosome is also alone after binning.
Thus this organisms might always be lost after the classical completeness/contamination filter. I don't know if its a common issue even with short reads?
I was investigating the genomes that I didn't manage to convert to near-complete MAGs in my assembly graph (the components in gray).
The circle on top left is actually a complete genome but with 40% completeness (both in metaMDBG and myloasm)
Two more days to submit your abstract for a short talk or poster at RECOMB-Seq 2026. See instructions at recomb-seq.github.io/seq2026/call...
This is from 7 years ago (merenlab.org/2019/02/24/f...). We are talking about the same things today. We will be talking about the same things 7 years from now. There is no one to blame for this apart from ourselves. I find it very depressing.
10 years after the first FAMSA paper, its successor is now published in Nat Biotech! We believe that FAMSA2 can enable analyses of large protein collections that were previously unattainable. Thank you, Andrzej and Cedric, for great collaboration
www.nature.com/articles/s41...
Accepted to ISMB'26. Revised paper is here: jermp.github.io/assets/pdf/p.... I'd like to thank @robp.bsky.social once again and all the received feedback from the reviewers. To me, ISMB has had the highest quality review process over the past few years!
Whenever I presented Phold, I was frequently asked "can you do the same beyond phages?" We ( @oschwengers.bsky.social @linsalrob.bsky.social @binomicalabs.org et al) finally did it with Baktfold github.com/gbouras13/ba... www.biorxiv.org/content/10.6...
Following up on this - MADRe is now officially published 🎉
Very grateful for the guidance of @msikic.bsky.social @rvicedomini.bsky.social and Kresimir Krizanovic
🔗 academic.oup.com/gigascience/...
Our work on 'hidden diversity' in unbinned contigs is now published in @natmicrobiol.nature.com :
www.nature.com/articles/s41...
See the linked threads for more details!
A run-length-compressed skiplist data structure for dynamic GBWTs supports time and space efficient pangenome operations over syncmers
doi.org/10.64898/202...
LongcallR for competitive SNP calling and haplotype phasing, and simplified allele-specific analysis with long RNA-seq reads. Found ~100 junctions affected by SNPs per sample with most junctions novel.
Developed by Neng Huang. Published in @natmethods.nature.com. Read at rdcu.be/faKhL
A quick rant on people vibe-translating our Rust libraries to other languages
That's the second time in a week that I see new bioinformatics tools with a vibe-coded translation of our Rust libraries to C/C++.
I have two major issues with that:
Congrats!!
Myloasm, our long-read metagenome assembler, is now published! w/ @mgmarin.bsky.social and @lh3lh3.bsky.social
Very rewarding after > a year of development and countless hours thinking about assembly. Thanks to beta testers, Li lab, and reviewers who gave very helpful feedback.
rdcu.be/famFj
Preprint alert!
TLDR:
Super Bloom is a Bloom-filter variant for streaming k-mer queries.
It uses minimizers to group adjacent k-mers into super-k-mers and map them to the same memory block.
Result: much better locality, faster queries, and with the findere trick, dramatically fewer false positives.
How much protein diversity can Life on Earth actually generate?
With DIAMOND DeepClust, we show how billions of proteins across the tree of life can be clustered at low-identity for downstream analytics tasks.
📚Paper: www.nature.com/articles/s41...
💻Code: github.com/bbuchfink/di...
_720 Gbp_ marine nanopore metagenome -> 328 circular prokaryotic contigs: using myloasm!
Insane work by Lui and Nielsen. Also shows how modern long read assemblies can disentangle coexisting strains and reveal ecological insights.
Recently amplified gene arrays are a super interesting phenomenon, but many still resist our attempts to assemble them. @dantipov.bsky.social has developed a new method (Trivial Tangle Traverser) that resolves assembly graph tangles caused by such sequences (1/4) www.biorxiv.org/content/10.6...
Long reads carry multiple small vars and SVs and their phasing. LongcallD is the only caller that tightly integrates germline/mosaic small/structural vars/MEIs and their phasing in a single C program. One command line to get competitive small variant calls and better SVs. Led by Yan Gao.
Super Bloom: Fast and precise filter for streaming k-mer queries www.biorxiv.org/content/10.64898/2026.03...
It's a good day when the first item in your feed is your own work :)
@rickbitloo.bsky.social was annoyed that scanning reads for all 96 rapid kit barcodes is bottleneck in Barbell, so he made Sassy2: 13x (150bp) to 4.6x (8kbp) faster than v1 by batch-searching patterns, and >100Gbp/s on 16 threads!
Very happy to share the latest paper of our group (et al.)! rdcu.be/e7zyX . This one has a special place… 1/n
1/ Our paper on Multi-Context Seeds is now out, with @tolyan.bsky.social spearheading the work and contributions from Nicolas and @marcelm.net. We introduce a new seeding concept that improves read alignment accuracy while maintaining speed.
link.springer.com/article/10.1...
Can't wait to release a 10-year-old birthday version for SeqKit!
- 10 years
- 2 papers, 3500 citations
- 20 contributors
- 40 subcommands
- 880 commits
- 500 issues
- 685.5K Bioconda total downloads
Thank you all, dear contributors and users!
I'll keep maintaining it.
github.com/shenwei356/s...
How would you design a *multithreaded*, *concurrent* & *dynamic* hash table if you are focused specifically on common k-mer workloads, where streaming query & insertion are common? Jamshed, Prashant and I explore this in kache-hash, a cache-friendly k-mer hash table!
www.biorxiv.org/content/10.6...