Did you miss the registration deadline for the Boston #NextflowSummit? Fear not - all talks will be live streamed online! π
The hybrid event will use the same platform as the October 2025 event, you can watch live and ask questions for the speaker. Hit the link below to register!
Posts by Phil Ewels
Heng Li's blog posts are always thoughtful and well written, this is no exception.
Great to see folks coming to similar conclusions around #Rust #rewrites in bioinformatics.
Also appreciate the RustQC / rewrites.bio cite ππ»
Your Bluesky bio says Blology, I quite like that π Like, the computational analysis of glass blowing or something..
Coming next month: besteditor.bio - a manifesto on how to configure vim key bindings for optimal bioinformatics analysis..
I don't think I advocated ignoring any licenses anywhere? Rather the opposite: "Check licenses before you start", meaning check that they are compatible. I can look into rewriting this section if my intentional meaning did not come across well.
The rewrite is not the nf-core pipeline, it's a separate project. The pipeline just calls a binary. The pipeline code is and always was MIT. The rewritten tool code I released as GPL-3, matching the strictest (and mostly consensus) license of the upstream software tools.
github.com/seqeralabs/R...
haha, happy to be of service π€£
Yes I've found the same - and tests are crucial in this context. I was aiming for an @nf-co.re / #rnaseq pipeline drop-in replacement so I ended up using the test suite we have there. It has output snapshots etc so is specific and easy to iterate with. Can always have more edge cases though..
The nf-core Hackathon in Boston is happening in just over 2 weeks! π»Join us April 28β29 for two days of collaborative hacking.ποΈ Register now: hubs.la/Q04brb4c0
Whether you want to add new features to existing pipelines, work on tooling, or tackle community initiatives β there's something for you.
This is one of several things that this project taught me: being strict and comprehensive on the testing, as early as possible, is absolutely key (and avoids a lot of pain later). I think this is much of what differentiates a high quality rewrite vs. poor, and is also key for trust + adoption.
Yeah I agree, this has always been the case but it's especially important now. I started with these 2 files and extended to a pipeline (github.com/seqeralabs/R...) and later the nf-core/rnaseq pipeline tests. There are a fair number of unit tests as well. Definitely could be improved though.
This conclusion is what drove me to put rewrites.bio together. A desire to get ahead of this rapid change and educate / demonstrate best practices and define how we want people to work with these new tools.
I've seen the same as well. This isn't a problem with LLMs though (especially as their quality improves), it's a problem with the people using them. That's what I would like to help improve. Not just cover our eyes and hope that people won't use LLMs.
Yup, documentation of the validation is essential for user trust. I tried to be cautious with my language, differences in RustQC are typically at the 14th decimal place or similar. It's all detailed on the docs pages: seqeralabs.github.io/RustQC/rna/d...
It's also encoded in the CI snapshots.
I'm not sure why use of LLMs indicates such an expectation. In fact I could imagine a future where the inverse is true - more people are empowered to help with maintenance. There are problems with this (slop, review burden etc). But the motivations for better results and software remain the same.
This approach won't work for everyone. Some will want new / different functionality which breaks the model. Some won't validate thoroughly enough. There are many risks. But there are also a lot of practical benefits if it's done well.
For RustQC (in nf-core/rnaseq) we will use continuous integration snapshot tests to keep confidence that outputs remain identical (or functionally identical at least) to the outputs created by the original tools. Just 60x faster.
The approach described there (and for RustQC) is to use large and precise validation against upstream tools and to keep them "hot swappable". So the upstream tools are unaffected and maintenance efforts continue there. The rewrite is for performance purposes only and tracks upstream changes.
This is the uncomfortable truth I'm keen to address, I would like to accelerate us towards some best practices before we fall into this trap. That was my hope for rewrites.bio - to start a discussion basically.
Very very important initiative, check it out #bioinformaticians!
AI coding assistants just passed a threshold: domain experts can now rewrite established scientific software in days.
This wave of rewrites is coming for #bioinformatics. The question isn't whether to embrace this capability, but how to do it responsibly.π§΅ hubs.la/Q04b85ww0
This approach makes tests / benchmarks essential, and they effectively *have* to be enough. This is where trusting the upstream tool comes in. It's very easy to do huge numbers of comparisons to that, so if we understand that code, then the rewrite can inherit that trust.
It's of course better if you read and understand the code. My argument here is that it's no longer a requirement with LLMs. You can discuss the code with an LLM, ask questions, make changes - all without understanding the syntax. Choose language based on it's features, not just your experience.
In the end I decided that this was a case-by-case scenario and I should leave it as a light-touch in rewrites.bio. But I agree that it's one of, if not *the* most contentious issue about LLM rewrites. And I don't have a good answer myself as to how we should deal with it.
Also worth noting that most tool maintainers probably don't want a PR that deletes all their code and has 100k lines of new code in a language that they don't know. Contributions should be helpful, not burdensome (as far as is possible).
If multiple tools are emulated in one package, how does one contribute that back upstream? And if not by code directly, it's probably of limited use, as the goal is exact replication - not any new insight or features.
If it's a 1:1 tool rewrite then I totally agree that every effort should be made.
In the first draft of the site I did have a point on this, but I felt a little unsure about it so I scaled it back to just bug reports (point 5.4). It can be complex if a rewrite incorporates multiple tools (as suggested in 3.1 and done in RustQC).
I totally agree - the very first point on rewrites.bio is about academic integrity: "Credit the original authors".
I wasn't suggesting that rewrites should be published. People should cite the underlying tools, eg. seqeralabs.github.io/RustQC/about...
This concept has come up in several different comment threads. It's not really something I thought about during the project, but I think it's really interesting!