Advertisement · 728 × 90

Posts by Vladimir Shitov

Huge thanks to co-authors: Christopher Lance, Malte Lücken, @fabiantheis.bsky.social, Daniel Burkhardt, and many others! Tons of thanks to all the competition participants. And of course, thanks Kaggle for hosting and support. Read the paper: biorxiv.org/content/10.6...

1 month ago 3 0 0 0
Post image

We noted that competitors did not rely much on prior biological information available in the databases. Turns out, it is not always helpful. While for the CITE-seq, adding prior information slightly boosted performance, for Multiome prediction, priors even made it worse. 16/17

1 month ago 0 0 1 0
Post image

Importantly, we show that top-performing models learn regulatory pathways. Using SHAP score analysis, we demonstrate that models capture known genetic regulators and yet unexplored genes. 15/17

1 month ago 0 0 1 0
Post image

Validation set selection is not less important than modelling. Choose a bad validation set, and it will prioritise models that work poorly on the unseen data. Here’s a correlation of validation score with private test score evaluated on variants of top-performing models:

1 month ago 0 0 1 0

Adversarial validation is not only a competition trick. If you know exactly the X for which you’d like to predict y, select part of your training data most similar to that X. E.g., if you need to predict for a particular patient, validate your models on the most similar patients

1 month ago 0 0 1 0
Post image

One of my favourite tricks was adversarial validation, which helped a top 3 performer to create generalisable models. The idea is to select a subset of training data most similar to the test data (by available X, e.g., RNA) to validate your models’ performance. 12/17

1 month ago 0 0 1 0

Many competitors did not have prior experience with single-cell data analysis and the biases associated with it. They used unconventional preprocessing methods and brought many novel ideas. If you are a model developer, this is a must-see! 11/17

1 month ago 1 0 2 0
Post image

The core feature of the top Multiome model was predicting a low-dimensional representation of the data (SVD features) and correcting it by predicting residuals. 10/17

1 month ago 0 0 1 0
Post image

We thoroughly dissected the winning models to understand what decisions led to good predictions. We were able to create even simpler models that still preserve the performance. Here’s an example of the top CITE-seq model (top 2 overall). Orange parts can be removed or simplified:

1 month ago 0 0 1 0
Advertisement

What made winning solutions so good? In short:
1. Extensive and diverse preprocessing of the data
2. Smart validation strategies to select the most generalisable models
3. Neural networks and model ensembles
8/17

1 month ago 1 0 1 0
Post image

Multiome task was more challenging and high-dimensional, so there is still space for growth. 7/17

1 month ago 0 0 1 0
Post image

In fact, the best CITE-seq model predicted proteins for cells from an unseen donor and day better than a KNN baseline trained on all data, including the test set (red line). It demonstrates that the model learned generalisable regulatory patterns. 6/17

1 month ago 0 0 1 0
Post image

The CITE-seq task was solved particularly well with a top-performing model reaching an average Pearson’s R of 0.848. Per-protein scores show that the best CITE-seq prediction model performed well across all surface proteins in the data. 5/17

1 month ago 0 0 1 0
Post image

To ensure models are generalisable, we evaluated them on an unseen donor and on a different day of differentiation. And competitors did extremely well! Here's a performance in the CITE-seq and Multiome tasks of the top 100 submissions. 4/17

1 month ago 0 0 1 0
Post image

The competitors had to solve 2 challenging tasks: predicting gene expression from chromatin accessibility (DNA → RNA, called the Multiome task), and predicting surface protein expression from RNA levels (CITE-seq task) in donor-derived peripheral blood mononuclear cells. 3/17

1 month ago 0 0 1 0

Check out our preprint: biorxiv.org/content/10.6...

We provide cleaned-up code for 2 winning models and example notebooks on how to run them with new data. See our GitHub: github.com/lueckenlab/O...
2/17

1 month ago 1 0 1 0
Post image

How to learn gene regulation patterns from multimodal single-cell data? To answer this question, we at OpenProblems organised the Kaggle competition on modality prediction. It ran back in 2022 but still remains the world's largest competition in the single-cell field. 1/17

1 month ago 5 2 1 0
Advertisement
Post image

No matter how much you love what you do, this one thing gives you a 10x energy boost

#phd

11 months ago 0 0 0 0

Fun fact: it was supposed to be a quick one-month project on the intersection of ethics and single-cell research to produce a one-page comment. But we got carried away and wrote a bit more 😅 I hope you learn something useful! I certainly did when working on it. 10/10

1 year ago 0 0 0 0
Preview
Biases in machine-learning models of human single-cell data - Nature Cell Biology This Perspective discusses the various biases that can emerge along the pipeline of machine learning-based single-cell analysis and presents methods to train models on human single-cell data in order ...

Want to see more examples and details? Check out the full publication: nature.com/articles/s41...

Thanks to all co-authors, especially @theresawillem.bsky.social, who did most of the work,
Malte Lücken, who initialised the collaboration, and
@fabiantheis.bsky.social. 9/10

1 year ago 2 0 1 0
Post image

6. Result interpretation bias. The complexity of modern methods sometimes leads to wrong interpretation of the results. The literature knows examples of UMAP-based conclusions or praising useless models because of data leakage to the metrics. 8/10

1 year ago 0 0 1 0
Post image

5. Machine learning bias. Batch effects in the data, not considering outliers, limitations of the used models, or wrong metrics can all lead to incorrect results. 7/10

1 year ago 0 0 1 0
Post image

4. Single-cell sequencing bias. Some cell types are often missing in the data for technical reasons (e.g. neutrophils). And even for captured cells, we don't see all RNA copies because of the dropout. 6/10

1 year ago 1 0 1 0
Post image

3. Cohort bias. Number of donors in SC studies is still quite low (see previous post: x.com/shitov_happe..., sorry for X link). Moreover, most of the samples in the datasets come from individuals with European ancestry. This can limit the generalization of conclusions to other populations. 5/10

1 year ago 0 0 1 0
Post image

2. Clinical bias. Patients with different conditions are not sampled uniformly. Especially, "healthy" controls might not reflect a population norm well. Not everyone wants to donate a piece of their lung or a brain for science. 4/10

1 year ago 0 0 1 0
Post image

1. Societal bias. The samples likely come from clinics or research institutions with quite some money to run single-cell experiments. Not everyone might have access to them. Be careful when extrapolating your conclusions to the general population. 3/10

1 year ago 0 0 1 0
Post image

Recently, a number of methods emerged for working with single-cell data at the sample level. We call them sample (in a clinical context – patient) representation methods. They enable patient stratification, prognostic and diagnostic capabilities. But be aware of the biases! 2/10

1 year ago 0 0 1 0
Advertisement
Preview
Biases in machine-learning models of human single-cell data - Nature Cell Biology This Perspective discusses the various biases that can emerge along the pipeline of machine learning-based single-cell analysis and presents methods to train models on human single-cell data in order ...

When applying machine learning to human health data, it is not enough to just improve a metric by another percent. We have to go deeper. In our perspective in Nature Cell Biology, we discuss caveats and biases of human single-cell data analysis: nature.com/articles/s41...
🧵 1/10

1 year ago 2 1 1 0

How do biases affect machine-learning models of human single-cell data? And what can we do about it? In our new Perspective article, "Biases in machine-learning models of human single-cell data," published in Nature Cell Biology, we explore these pressing questions.

👉🏻 www.nature.com/articles/s41...

1 year ago 6 2 1 0

That led to amazing comebacks sometimes. An ace could massacre an entire group, but then meet a six and lose the army. Also it was fascinating to think about the best strategies where to put your strongest and weakest cards

1 year ago 1 0 0 0