Huge thanks to co-authors: Christopher Lance, Malte Lücken, @fabiantheis.bsky.social, Daniel Burkhardt, and many others! Tons of thanks to all the competition participants. And of course, thanks Kaggle for hosting and support. Read the paper: biorxiv.org/content/10.6...
Posts by Vladimir Shitov
We noted that competitors did not rely much on prior biological information available in the databases. Turns out, it is not always helpful. While for the CITE-seq, adding prior information slightly boosted performance, for Multiome prediction, priors even made it worse. 16/17
Importantly, we show that top-performing models learn regulatory pathways. Using SHAP score analysis, we demonstrate that models capture known genetic regulators and yet unexplored genes. 15/17
Validation set selection is not less important than modelling. Choose a bad validation set, and it will prioritise models that work poorly on the unseen data. Here’s a correlation of validation score with private test score evaluated on variants of top-performing models:
Adversarial validation is not only a competition trick. If you know exactly the X for which you’d like to predict y, select part of your training data most similar to that X. E.g., if you need to predict for a particular patient, validate your models on the most similar patients
One of my favourite tricks was adversarial validation, which helped a top 3 performer to create generalisable models. The idea is to select a subset of training data most similar to the test data (by available X, e.g., RNA) to validate your models’ performance. 12/17
Many competitors did not have prior experience with single-cell data analysis and the biases associated with it. They used unconventional preprocessing methods and brought many novel ideas. If you are a model developer, this is a must-see! 11/17
The core feature of the top Multiome model was predicting a low-dimensional representation of the data (SVD features) and correcting it by predicting residuals. 10/17
We thoroughly dissected the winning models to understand what decisions led to good predictions. We were able to create even simpler models that still preserve the performance. Here’s an example of the top CITE-seq model (top 2 overall). Orange parts can be removed or simplified:
What made winning solutions so good? In short:
1. Extensive and diverse preprocessing of the data
2. Smart validation strategies to select the most generalisable models
3. Neural networks and model ensembles
8/17
Multiome task was more challenging and high-dimensional, so there is still space for growth. 7/17
In fact, the best CITE-seq model predicted proteins for cells from an unseen donor and day better than a KNN baseline trained on all data, including the test set (red line). It demonstrates that the model learned generalisable regulatory patterns. 6/17
The CITE-seq task was solved particularly well with a top-performing model reaching an average Pearson’s R of 0.848. Per-protein scores show that the best CITE-seq prediction model performed well across all surface proteins in the data. 5/17
To ensure models are generalisable, we evaluated them on an unseen donor and on a different day of differentiation. And competitors did extremely well! Here's a performance in the CITE-seq and Multiome tasks of the top 100 submissions. 4/17
The competitors had to solve 2 challenging tasks: predicting gene expression from chromatin accessibility (DNA → RNA, called the Multiome task), and predicting surface protein expression from RNA levels (CITE-seq task) in donor-derived peripheral blood mononuclear cells. 3/17
Check out our preprint: biorxiv.org/content/10.6...
We provide cleaned-up code for 2 winning models and example notebooks on how to run them with new data. See our GitHub: github.com/lueckenlab/O...
2/17
How to learn gene regulation patterns from multimodal single-cell data? To answer this question, we at OpenProblems organised the Kaggle competition on modality prediction. It ran back in 2022 but still remains the world's largest competition in the single-cell field. 1/17
No matter how much you love what you do, this one thing gives you a 10x energy boost
#phd
Fun fact: it was supposed to be a quick one-month project on the intersection of ethics and single-cell research to produce a one-page comment. But we got carried away and wrote a bit more 😅 I hope you learn something useful! I certainly did when working on it. 10/10
Want to see more examples and details? Check out the full publication: nature.com/articles/s41...
Thanks to all co-authors, especially @theresawillem.bsky.social, who did most of the work,
Malte Lücken, who initialised the collaboration, and
@fabiantheis.bsky.social. 9/10
6. Result interpretation bias. The complexity of modern methods sometimes leads to wrong interpretation of the results. The literature knows examples of UMAP-based conclusions or praising useless models because of data leakage to the metrics. 8/10
5. Machine learning bias. Batch effects in the data, not considering outliers, limitations of the used models, or wrong metrics can all lead to incorrect results. 7/10
4. Single-cell sequencing bias. Some cell types are often missing in the data for technical reasons (e.g. neutrophils). And even for captured cells, we don't see all RNA copies because of the dropout. 6/10
3. Cohort bias. Number of donors in SC studies is still quite low (see previous post: x.com/shitov_happe..., sorry for X link). Moreover, most of the samples in the datasets come from individuals with European ancestry. This can limit the generalization of conclusions to other populations. 5/10
2. Clinical bias. Patients with different conditions are not sampled uniformly. Especially, "healthy" controls might not reflect a population norm well. Not everyone wants to donate a piece of their lung or a brain for science. 4/10
1. Societal bias. The samples likely come from clinics or research institutions with quite some money to run single-cell experiments. Not everyone might have access to them. Be careful when extrapolating your conclusions to the general population. 3/10
Recently, a number of methods emerged for working with single-cell data at the sample level. We call them sample (in a clinical context – patient) representation methods. They enable patient stratification, prognostic and diagnostic capabilities. But be aware of the biases! 2/10
When applying machine learning to human health data, it is not enough to just improve a metric by another percent. We have to go deeper. In our perspective in Nature Cell Biology, we discuss caveats and biases of human single-cell data analysis: nature.com/articles/s41...
🧵 1/10
How do biases affect machine-learning models of human single-cell data? And what can we do about it? In our new Perspective article, "Biases in machine-learning models of human single-cell data," published in Nature Cell Biology, we explore these pressing questions.
👉🏻 www.nature.com/articles/s41...
That led to amazing comebacks sometimes. An ace could massacre an entire group, but then meet a six and lose the army. Also it was fascinating to think about the best strategies where to put your strongest and weakest cards