🎉 That’s a wrap! The International CRG-BI Postdoc Symposium has come to an end after three incredible days of inspiring talks, engaging discussions, and valuable connections. I have to say I couldn’t be happier! #CRGBIpostdocs
A 🧵 below 👇🏼
Posts by Mathys Grapotte
Super impressive work by @angelp.bsky.social and colleagues at AWS / ARM on porting #Bioconda packages to #arm64 - there's potential for some big savings on compute as we scale this up!
Check out the @nf-co.re #arm64 channel for more and to get involved in the effort...
Funny thing is that if you type TRUE + !TRUE you get the proper answer (1).
here is the talk if you are curious
Wow honored to get a shoutout on the podcast ! Talking at the nextflow summit was a great experience, and I encourage you to apply for a talk at the next iterations
thanks Ian !
So no stable release yet (still in dev), we are trying to make this extremely easy to contribute, add stuff to, we have an active slack channel, open dev hours every Wednesday from 2pm to 6 pm CET, we participate to all nf-core hackathons, so a big community effort and super easy to join in !
Happy to discuss this further :) maybe a slack/discord channel or zoom talk etc.
(Also, we are trying to build such framework already for a while at github.com/nf-core/deep... , currently pipeline is moving a lot because we are in the process of porting of our code to nf-core)
I think @nextflow.io and @nf-co.re is the best place to build this at because it has all the qualities (open source, large community already - 8k developers on slack,, performant, easy to use etc.). and already has all the bio software we need to process raw data (mappers, aligners, etc.)
If framework is performant, easy to use, flexible, easy to contribute to, easy to understand, good looking etc. I bet proper guidelines will come naturally (and will vary based on use cases, which has always been in software).
Such a framework shouldn’t impose guidelines on users, only provide a convenient way to run all kinds of tests on a research prototype (so different from the models folks use in clinic applications).
So I think the best solution will come in the form of a robust test framework (analog to pytest in python for instance), which could do both unittests (theory tests on architecture + training “soundness” ) and integration tests (downstream tests).
I also disagree with these takes, seeing Meta’s talk at the ray summit for instance, I think those companies have robust stat eval pipelines in place (one does not press a $100m button without knowing what will come out of it).
However (probably an unpopular opinion), I think that a paper is not a good enough vessel for those “guidelines”. I agree with the OP’s point here even though I think the guidelines of that paper are quite superficial.
The principles of Deep Learning theory book is also a test gold mine. Generally, I think NTK is a good probabilistic framework for designing evaluations of DL models.
Or here cosine similarity between sorted singular vectors detecting structural shift after fine tuning.
Thankfully, with theory making progress, there are many things we can test prior and during training, for example here : evolution of matrix rank as a proxy for information quantity in layers
So, I think we should have a “software” approach i.e. deep learning code being vetted before or as it is running.. This could still be useful because training from scratch on a couple of batches might be enough to detect issues (and cost effective).
However, downstream tests do not pinpoint issues well enough (i.e. is the training data the issue? is it the code? instability?...)
First from this post and @cwognum.bsky.social , @ihaque.bsky.social :
I agree that downstream test development is useful, and extra convenient. (no need to retrain, can think of model as black box, gives interesting bio insights etc.)
I saw the discussion on #BioMLeval pop up thanks to this post and @ianholmes.org. I think this is an interesting + extremely valuable discussion - super happy to see people interested in bioML eval.
I am very interested, actually at CRG and within the @nf-co.re organisation, we are building an open source framework that will have all those tests built in (it's in my bio). For that purpose, I collected many papers from various ML fields and would love to share/discuss
There are many more intricate hypothesis that I could think of, I think this is the right application of LLMs in bioml.
There are lots of things we could do with this
- Are pathogenic variants less expected by *insert LLM method* than non-pathonegic variants?
- Is perplexity lower for notably conserved regions ?
- Can this be used to find conserved regions in new genomes ?
Imo, one of the most interesting figures (S6.b) is buried in the supp. data.
From how I understand it, "if evo is good at predicting the next token in that sequence, then if it makes a mistake, it is likely due to an unexpected and impactful variant"