The minimum required version of polars has been increased from 0.20 to 1.5.
Posts by Skrub
The TableReport custom filters have been improved and expanded: they can now take skrub selectors for filtering columns. The interface has also been simplified.
The has_nulls selector can now select columns based on a user-specified threshold of null values.
It is now possible to provide custom null values to the Cleaner, so that they are marked as nulls (for example, the string "unknown").
The performance of DataOps with many computational nodes has been improved. Additionally, DataOps CV splitters can now take kwargs. For example, this allows to specify groups when creating train/test splits.
The SingleColumnTransformer and RejectColumn classes allow the construction of custom-made transformers for specific use cases.
The ApplyToCols transformer is now a powerful alternative to the regular scikit-learn ColumnTransformer. It is now possible to apply any transformer to a subset of chosen columns using the skrub selectors.
✨ skrub version 0.8.0 has been released ✨
This version includes several new features, including multiple improvements to the functionality and performance of the Data Ops, along with a few bug fixes and improvements to the docs.
Changelog:
skrub-data.org/stable/CHANG...
Highlights below ⤵️
In addition, we will begin crediting specific contributors here on Bluesky when a contributor has worked on the subject of the post. We will use GitHub handles for this purpose. If you prefer your handle not to be used or would like to be credited by name instead, please let us know.
As a follow-up, we would like to clarify how we’ll be crediting contributors moving forward.
Currently, all contributions to the repository are tracked in the changelog and highlighted in the release notes, where each PR and the GitHub handle of its author are listed.
Thanks to e-strauss for writing this example!
While skrub Data Ops shine when preparing dataframes, their capabilities extend beyond that. For example, they can be used alongside libraries like PyTorch and skorch to work with images, and tune the model size to find the best set of hyperparameters:
skrub-data.org/stable/auto_...
- A new example has been added to show how skrub Data Ops can be used with pytorch and skorch to solve an image classification task.
skrub-data.org/stable/auto_...
Main changes:
- The StringEncoder now exposes the vocabulary parameter, allowing it to be passed to the underlying TfidfVectorizer.
- The function compute_ngram_distance has been made private to reduce clutter.
- The repository wheel has been made smaller by removing some benchmarking material.
✨ skrub version 0.7.2 has been released ✨
In this release we squashed more bugs, improved the API reference, and added a new example.
github.com/skrub-data/s...
At the end, you get a fully-fledged Optuna study to work
with. Of course, that includes support for the Optuna dashboard and access to the Optuna reporting and plotting interfaces.
Three snippets of python code showing how to use skrub Data Ops with the Optuna optimization library.The first snippet shows a standard randomized search with the Data Ops. The second snippet adds the parameter "backend", which is set to "optuna". The third snippet uses the Optuna visualization API to plot information from the study.
Did you know that the skrub Data Ops support Optuna as backend to run hyperparameter search?
It's as easy as writing "backend='optuna'": this will set up a default Optuna study (and the TPE sampler) to replace the standard random sampler.
Happy new year! 🎉🎉🎉
Let's celebrate 2026 with a bugfix release that implements some fixes, brings some documentation improvements and adds a new dataset fetcher:
github.com/skrub-data/s...
The course covers:
- How to explore and sanitize data with skrub
- How to use the skrub transformers for powerful and reliable feature engineering
- How to put everything together in a machine learning pipeline
Skrub Data Ops are not included (yet).
Do you want to learn how to use skrub like a pro? Then you're in luck!
Inria Academy is providing an introductory course on skrub aimed at IT personnel, engineers, data scientists, and data analysts.
www.inria-academy.fr/formation/sk...
The recording of the talk we did at @pydataparis.bsky.social 2025 is now available on the PyData Youtube channel! 🚀
You can find it here, if you want to check it out 👀
www.youtube.com/watch?v=k9MN...
Skrub 0.7.0 is here! 🎉
✨ Main highlights:
- Tune hyperparameter choices with Optuna
- Added support for Pandas 3.0
- Estimators in data ops can now take additional kwargs
16 new contributors helped with this release 👥
Check out the full changelog: github.com/skrub-data/s...
@skrub-data.bsky.social: better data-science primitives for clean code on dataframes
Watch my dotAI talk, it's fun (live coding)!
www.youtube.com/watch?v=bQS4...
skrub really makes it easy to do machine learning with dataframes
For even more control over column selection, skrub provides a collection of selectors that let you partition dataframes by data type, column name, or user-specified functions.
All these transformers can be concatenated and inserted in a scikit-learn pipeline to build a feature matrix with complex column selection operation, and can be seen as an alternative for the scikit-learn ColumnTransformer.