Skrub (@skrub-data) Bsky

The minimum required version of polars has been increased from 0.20 to 1.5.

3 weeks ago 2 1 0 0

The TableReport custom filters have been improved and expanded: they can now take skrub selectors for filtering columns. The interface has also been simplified.

3 weeks ago 2 1 1 0

The has_nulls selector can now select columns based on a user-specified threshold of null values.

3 weeks ago 2 1 1 0

It is now possible to provide custom null values to the Cleaner, so that they are marked as nulls (for example, the string "unknown").

3 weeks ago 2 1 1 0

The performance of DataOps with many computational nodes has been improved. Additionally, DataOps CV splitters can now take kwargs. For example, this allows to specify groups when creating train/test splits.

3 weeks ago 2 1 1 0

The SingleColumnTransformer and RejectColumn classes allow the construction of custom-made transformers for specific use cases.

3 weeks ago 2 1 1 0

The ApplyToCols transformer is now a powerful alternative to the regular scikit-learn ColumnTransformer. It is now possible to apply any transformer to a subset of chosen columns using the skrub selectors.

3 weeks ago 3 1 1 0

Release history Release 0.8.0: New Features: The eager_data_ops configuration option has been added. When set to False, no previews are computed and validation is deferred until the DataOp is actually used (e.g. w...

✨ skrub version 0.8.0 has been released ✨

This version includes several new features, including multiple improvements to the functionality and performance of the Data Ops, along with a few bug fixes and improvements to the docs.

Changelog:
skrub-data.org/stable/CHANG...

Highlights below ⤵️

3 weeks ago 8 4 1 1

Join the Skrub Discord Server! Check out the Skrub community on Discord – hang out with 106 other members and enjoy free voice and text chat.

You can contact us either here or on our Discord server: discord.gg/ABaPnm7fDC

2 months ago 2 0 0 0

In addition, we will begin crediting specific contributors here on Bluesky when a contributor has worked on the subject of the post. We will use GitHub handles for this purpose. If you prefer your handle not to be used or would like to be credited by name instead, please let us know.

2 months ago 0 0 1 0

As a follow-up, we would like to clarify how we’ll be crediting contributors moving forward.

Currently, all contributions to the repository are tracked in the changelog and highlighted in the release notes, where each PR and the GitHub handle of its author are listed.

2 months ago 0 0 1 0

Thanks to e-strauss for writing this example!

2 months ago 0 0 0 1

Using PyTorch (via skorch) in DataOps This example shows how to wrap a PyTorch model with skorch and plug it into a skrub DataOps plan. The main goal here is to show the integration pattern: PyTorch defines the model (an nn.Module), sk...

While skrub Data Ops shine when preparing dataframes, their capabilities extend beyond that. For example, they can be used alongside libraries like PyTorch and skorch to work with images, and tune the model size to find the best set of hyperparameters:

skrub-data.org/stable/auto_...

2 months ago 0 0 1 0

Using PyTorch (via skorch) in DataOps This example shows how to wrap a PyTorch model with skorch and plug it into a skrub DataOps plan. The main goal here is to show the integration pattern: PyTorch defines the model (an nn.Module), sk...

- A new example has been added to show how skrub Data Ops can be used with pytorch and skorch to solve an image classification task.

skrub-data.org/stable/auto_...

2 months ago 0 1 0 0

Main changes:
- The StringEncoder now exposes the vocabulary parameter, allowing it to be passed to the underlying TfidfVectorizer.
- The function compute_ngram_distance has been made private to reduce clutter.
- The repository wheel has been made smaller by removing some benchmarking material.

2 months ago 1 1 1 0

Release Skrub release 0.7.2 · skrub-data/skrub ✨ skrub version 0.7.2 has been released ✨ In this release we squashed more bugs, improved the API reference, and added a new example. Main changes: The StringEncoder now exposes the vocabulary par...

✨ skrub version 0.7.2 has been released ✨

In this release we squashed more bugs, improved the API reference, and added a new example.

github.com/skrub-data/s...

2 months ago 2 1 1 0

Tuning DataOps with Optuna This example shows how to use Optuna to tune the hyperparameters of a skrub DataOp. As seen in the previous example, skrub DataOps can contain “choices”, objects created with choose_from(), choose_...

Here is a full example on how to use skrub Data Ops with Optuna

skrub-data.org/stable/auto_...

2 months ago 0 1 0 0

At the end, you get a fully-fledged Optuna study to work
with. Of course, that includes support for the Optuna dashboard and access to the Optuna reporting and plotting interfaces.

2 months ago 0 1 1 0

Three snippets of python code showing how to use skrub Data Ops with the Optuna optimization library.The first snippet shows a standard randomized search with the Data Ops. The second snippet adds the parameter "backend", which is set to "optuna". The third snippet uses the Optuna visualization API to plot information from the study.

Did you know that the skrub Data Ops support Optuna as backend to run hyperparameter search?

It's as easy as writing "backend='optuna'": this will set up a default Optuna study (and the TPE sampler) to replace the standard random sampler.

2 months ago 4 2 1 0

Release Skrub release 0.7.1 · skrub-data/skrub Release 0.7.1 New features A new dataset, fetch_california_housing(), has been added to the skrub.datasets module. It allows to get a redundancy copy of the scikit-learn fetch_california_housing()...

Happy new year! 🎉🎉🎉

Let's celebrate 2026 with a bugfix release that implements some fixes, brings some documentation improvements and adds a new dataset fetcher:

github.com/skrub-data/s...

3 months ago 1 0 0 0

The course covers:
- How to explore and sanitize data with skrub
- How to use the skrub transformers for powerful and reliable feature engineering
- How to put everything together in a machine learning pipeline

Skrub Data Ops are not included (yet).

4 months ago 1 0 0 0

skrub like a pro: clean, prepare, and transform your data faster - Inria Academy

Do you want to learn how to use skrub like a pro? Then you're in luck!

Inria Academy is providing an introductory course on skrub aimed at IT personnel, engineers, data scientists, and data analysts.

www.inria-academy.fr/formation/sk...

4 months ago 3 0 1 0

Skrub: machine learning for dataframes YouTube video by PyData

The recording of the talk we did at @pydataparis.bsky.social 2025 is now available on the PyData Youtube channel! 🚀

You can find it here, if you want to check it out 👀

www.youtube.com/watch?v=k9MN...

4 months ago 3 0 0 1

Release Skrub release 0.7.0 · skrub-data/skrub Release 0.7.0 ✨ Highlights Data Ops can now be tuned with Optuna. It is now possible to pass extra named arguments to an estimator through DataOps.skb.apply. The TableReport now supports numpy arr...

Skrub 0.7.0 is here! 🎉

✨ Main highlights:
- Tune hyperparameter choices with Optuna
- Added support for Pandas 3.0
- Estimators in data ops can now take additional kwargs

16 new contributors helped with this release 👥

Check out the full changelog: github.com/skrub-data/s...

4 months ago 3 0 0 0

Clean code in Data Science - Gael Varoquaux - Skrub DataOps, Probabl: YouTube video by dotconferences

@skrub-data.bsky.social: better data-science primitives for clean code on dataframes

Watch my dotAI talk, it's fun (live coding)!
www.youtube.com/watch?v=bQS4...
skrub really makes it easy to do machine learning with dataframes

5 months ago 27 8 0 0

ApplyToFrame Gallery examples: Hands-On with Column Selection and Transformers

skrub-data.org/stable/refer...

6 months ago 0 0 0 0

ApplyToCols Gallery examples: Getting Started Hands-On with Column Selection and Transformers

skrub-data.org/stable/refer...

6 months ago 1 0 1 0

Hands-On with Column Selection and Transformers In previous examples, we saw how skrub provides powerful abstractions like TableVectorizer and tabular_pipeline() to create pipelines. In this new example, we show how to create more flexible pipel...

Example: skrub-data.org/stable/auto_...

6 months ago 0 0 1 0

For even more control over column selection, skrub provides a collection of selectors that let you partition dataframes by data type, column name, or user-specified functions.

6 months ago 0 0 1 0

All these transformers can be concatenated and inserted in a scikit-learn pipeline to build a feature matrix with complex column selection operation, and can be seen as an alternative for the scikit-learn ColumnTransformer.

6 months ago 0 0 1 0

Posts by Skrub