Moreover, I think the data analogy is on point. "Research parasites" are on the whole fantastic, driving forward tons of science. Even if on occasion there is noise. I bristle at the idea that certain groups start to dictate what is and is not acceptable. That's gatekeeping, not community building.
Posts by Lior Pachter
... mathematicians (e.g. Terence Tao), have focused on the (half) full portion of the glass. Individuals who previously had less knowledge in an area, can potentially meaningfully contribute now in ways not possible before. The same is true for software. And that should be encouraged.
There is an interesting point here that I agree with: there is one difference between data and software that makes software more akin to math: rapidly generated "proofs" in math are noise for the community, and can be time consuming to check. However...
Sorry for the confusion. I linked to that issue because "tilting at windmills" suggests I'm wasting my time on a pointless matter, but the fact is that there is active debate on this matter.
My position is that the attempt of Longo and Drazen to police the reuse of published data was misguided.
I have no idea whether their code is slop nor whether it's poorly attended. I haven't looked at it.
Maybe. Maybe not. github.com/henriksson-l...
(Data) research parasites nowadays have tons of exciting opportunities. I love this piece by @skinnider.bsky.social: academic.oup.com/gigascience/...
(Software) research parasites do as well!
Of course people can do problematic things with open source software. That's been the case for a long time, and certainly now the ability to be malicious is expanded with LLMs. However, trying to codify problematic scenarios is just an exercise in restricting freedom for others to benefit from OSS.
To paraphrase @iddux.bsky.social, "I propose to extend the research parasite award to those who used someone else's code to do some really cool sh*t". Their permission or collaboration is absolutely not required. researchparasite.com
"Setting out principles" just sounds like drafting a code of conduct to restrict the use of open source software in much the same way that Longo & Drazen advocated for restricting the use of published data in 2016: www.nejm.org/doi/full/10....
It's possible that more restrictive licensing can help but I doubt that will matter in practice (who will enforce academic licensing terms?)
So the specific versioning v10.0 would not work because the version number tracks my birthday (currently 0.52). But your point is well taken, and yes, that would be annoying but I think something like it is likely to happen (also I've seen much worse in academia).
If someone writes a paper pretending they discovered pseudoalignment without attributing the discovery to the kallisto authors, and then copies the kallisto software (algorithm, parameters, etc.) to showcase it but gives the software a new name... that's plagiarism and not ok.
I have no problem with either Callisto or rullisto. Of course rullisto is more valuable, but people are free to do poor work if they wish. kallisto is BSD-2 and the license allows for that.
BUT....
I brought nf-core into the conversation because it's yours, and it's MIT, and so nothing in rewrite.bio matters. Anyone can port or modify it, period. And that's not only ok it should be encouraged. If you think "emulate exactly" matters you should change the license.
rewrite.bio mentions licensing only in section 5.3 in the context of advocating for open source and compliance. At the same time the remaining four sections are a call for onerous requirements which are completely irrelevant under most licensing (e.g. section 2.2 "emulate exactly").
Behind every extrernally lauded ‘disruptor’, there seems to be a half dozen (minimum) actual subject matter experts shaking their heads “no” and grimacing dramatically. Good to keep in mind, for those of us prone to a bit of epistemic…moseying.
Within my lab edgePython made possible XgenePy which in turn is facilitating another lab to incorporate the method in another tool: github.com/pachterlab/X...
Together these examples make clear to me that porting with LLMs is immediately useful.
Moreover, the idea (not software) described in the preprint is discussed in the dreampy paper (which provides an alternate approach): www.biorxiv.org/content/10.6...
4. edgePython has already been used for multiple other projects (single-cell, because it's in Python). For example for Allos: www.biorxiv.org/content/10.6...
The relevance of single-cell is that I also ported large parts of NEBULA and combined with edgeR into edgePython to create a new single-cell tool.
3. Regarding a pull request in the original language, that's an interesting thought but a lot of extra work. I'm also less comfortable in R than Python.
The preprint could have been a README, but a preprint has a DOI and is permanent, and that higher bar provides more confidence in the port.
2. A Python port of edgeR is not just a matter of language. The single-cell ecosystem is in Python, so there is immediate benefit. That's why I did it.
I ported edgeR to Python (along with parts of NEBULA) and wrote a preprint: www.biorxiv.org/content/10.6...
Some responses to your questions / comments:
1. I wrote the preprint because the port required demonstrating specific results (parity, runtime, results with a new method).
Thanks for the invitation. However, FYI this event has not made me angry. It has made me sad. I do think that engaging with it will make me angry. Since I'd rather be happy than angry, during the event I'm planning to spark productive conversations by working on research projects with my students.
Tl;dr: two preprints on contrastive PCA w/ Maria Carilli and Kayla Jackson solve the non-spatial setting, spatial contrastive PCA, and contrastive functional PCA. www.biorxiv.org/content/10.1...
www.biorxiv.org/content/10.6... 17/17
This project was tons of fun. It started with a journal club reviewing the @jameszou.bsky.social contrastive PCA paper last fall. Maria Carilli and Kayla Jackson have been incredible in figuring out all the details, including working out all the math (in the supplements). 16/
Thus, ρPCA, k-ρPCA, f-ρPCA are immediately useful across a wide variety of applications. Moreover, they can be combined (e.g. kernel weighted PCA on basis coefficients). The interpretability is especially useful in biology, but we expect also in other scientific domains. 14/
Functional PCA has a long history dating back to the 1990s. I think the contrastive version we introduce via the Rayleigh quotient is going to be very useful. We showcase the power on a bulk RNA-seq time series, finding relevant and interesting differentially variable genes. 13/