v4.0.0 of rasusa is out now in all major retailers ποΈ
github.com/mbhall88/ras...
Big new feature is support for unaligned SAM/BAM/CRAM as input (and output) to the reads subcommand.
You can even pass in BAM and ask for FASTQ out the other side if you're into that kind of thing.
Posts by Michael Hall
π For all ye #snakemake users out there who want consistent, opinionated formatting of their workflows: we have just released v1.0.0 of snakefmt π₯³
The major update is that it now sorts rule directives (e.g., input, output, resources, params, shell etc.).
See github.com/snakemake/sn... for more
There's lot more interesting things we looked at:
- We find some credible transmission links in India
- We also see some kind of (likely) reduction IS-mediated funny business going on in the linear plasmid.
Checkout the preprint for all the details.
We deemed this transposon Tn8026. We did some global screening and found Tn8026 in a variety of countries, with the earliest evidence being Norway in 2012. We also found it in 2 E. gallinarum isolates from S. Korea. PLUS it was also in the chromosome of one of our isolates!
To add to the intrigue, the linezolid resistance mechanism, a gene called poxtA-Ef, was located on this linear plasmid, along with Tn1546, which carries the vancomycin resistance gene cluster.
Upon further inspection, we realised poxtA-Ef was in what turns out to be an uncharacterised transposon
Turns out most of these isolates had a LINEAR plasmid. Really showing my inexperience here as I did not know that was a thing.
After doing some more reading I found that Jia Beh from the Doherty in Melbourne, Aus. had also found a linear plasmid in some LREs (as had a couple of others globally)
The dataset was linezolid resistant Enterococcus (LRE), which are very concerning pathogens that are resistant to nearly everything.
We sequenced all these on ONT and I started by making assemblies. First shout out to @rrwick.bsky.social for the beautiful piece of software that is Autocycler!
Until joining @loolibear.bsky.social's lab in July, I embarrassingly hadn't had much experience with plasmids.
So when I started, Leah said "here you go, have a look at this dataset".
What a fun ride this has been.
Preprint out today and thread below
www.medrxiv.org/content/10.6...
Does anyone else think they are seeing post-acceptance editorial changes at proof stage which are error-prone and probably due to adoption of AI?
So nohuman now ships an unmasked HPRC.r2 DB by default, with optional dataset selection.
If youβve used nohuman before, I highly recommend updating to v0.5.0 and re-downloading the new DB.
Repo: github.com/mbhall88/nohuman
Keep your metagenomes clean π§Ήπ§¬
At the same time, I realised the Human Pnagenome Reference Consortium had made a second release of genomes.
So I rebuilt release 1 without masking, and added a release 2 database with no masking. The improvement in detection accuracy was substantial:
π¨ Update to nohuman π¨
While testing against the standard Kraken DB, I noticed Kraken was detecting far more human reads than nohuman. I realised Kraken masks low-complexity regions by default during DB construction and that setting had been left on in nohuman, leading to missing human reads.
Stars are level of p value (description is in the figure caption in the paper)
True.
Thanks for the great questions and discussion
Correct. Yeah I guess mash on a random subset should perform similarly. Havenβt looked at that though.
Itβs a decent sample size at 3000. But I guess more would always be better. I wanted to use refseq genomes which has long read data to be as sure as possible about the true size
There is likely inherent biases though based on error rates in reads for the kmer based methods
- Overlaps are pairwise alignment with minima2 (FFI)
-Thanks!
- See other thread where I have answered this
I just used mash v2.3. The supplement has an exploration of the best parameters to use for mash to estimate genome size. Mash was the fastest tool though.
Thanks for appreciating the plots. I obsessed a lot over them. I created a repo for the colour palette too if youβre interested in that github.com/mbhall88/cud
the bars are pair wise statistical comparisons. I only show the significant ones so as not to over clutter the plot
And lastly, a HUGE thank you to @lachlanjmc.bsky.social for a lot of the methodological heavy lifting when we were coming up with the idea
You might remember the preprint from late last year... Reviews/Publication were delayed while I was on parental leave. We extended validation to include H. sapiens, which lead to smarter handling of contained overlaps in repetitive genomes. Big shout-out to Chenxi Zhou for leading that part
However, the computational resource usage (runtime/memory) of LRGE was MUCH better than assembling
We benchmarked >3,000 bacterial genomes and found that LRGE (our method) achieves significantly better accuracy than k-mer-based methods like Mash and GenomeScope and performs on par with full genome assembly (Raven)
Our method for genome size estimation from long-read overlaps is now published π₯³
academic.oup.com/bioinformati...
New from @dgpratas.bsky.social et al. for analyzing multiple sequences in multi-FASTA format using alignment-free methodologies. Scalable to millions of sequences for pandemic research and more
AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data doi.org/10.1093/giga...
βClarivateβs decision rewards journals for continuing the unhelpful practice of keeping peer review information hidden and unintentionally presenting incomplete and inadequate studies as sound science and punishes those journals that are more transparent.β ππ
www.coalition-s.org/blog/how-the...
The DOI URL doesn't seem to be working for the preprint currently. You can find it here: www.biorxiv.org/content/10.1...