✨ skrub version 0.8.0 has been released ✨
This version includes several new features, including multiple improvements to the functionality and performance of the Data Ops, along with a few bug fixes and improvements to the docs.
Changelog:
skrub-data.org/stable/CHANG...
Highlights below ⤵️
Posts by Riccardo Cappuzzo
For context, $375M is about two days worth of profit for Meta in 2025
a screenshot of an email that reads "hey sorry - my agent got a mind of its own and started applying for jobs for me"
What a world we are already living in
www.adriankrebs.ch/blog/dead-in...
And this is the result recorded with asciinema
Selecting a file opens it in VS Code at the given line, very convenient
Tags are built with this
the screenshot of a shell script that uses fzf and ripgrep to find substrings, classes, and files in the skrub repository
Rabbit hole of the day: writing a command that fuzzy searches in the repository for any substring, shows me a preview of the line with context and opens the file at the given line in VS Code.
Requires fzf, universal-ctags and batcat
And yes I understand that there may be some (a lot of?) human influence on the blog post. However, whether it was a human or an AI doesn't matter: the end result is the same, and the conditions for the same thing to happen with no human in the loop are likely already here.
I've always thought OpenClaw was a bad idea (giving an AI agent free reign over my PC? *insanity*)
I did not realize it was "write a hit piece and publish it in a blog in retaliation for closing a PR" bad.
theshamblog.com/an-ai-agent-...
good pr
a green baseball cap with "man I love Fauna" written on it
I'm sorry, I couldn't resist
The more I hear about Clawbot the more I'm convinced it's some kind of social experiment trying to figure out how many people are willing to put their entire private and professional lives in the hands of an overeager child open to the unlimited influence of the world wide web
This was an interesting bug to track down.
A short script demonstrating how the `guess_datetime_format` function of pandas does not work as intended when trying to parse the datetime "1959-01-01 19:59:16": it returns none instead of returning the correct datetime format.
Funny bug of the day: if you try to use pandas' "guess_datetime_format" with datetimes where the hour and minute are the same as the year (like 1959 and 19:59), the parser will fail and return None.
This bug is present in pandas 2.3.3, but has been fixed in the dev version.
I've seen it being described as "Broetry". It's explored quite well in this article I read some time ago: fenwick.media/rewild/magaz...
Something that immediately ended up being a roadblock was the "This cell redefines variables..." error.
I realized that when I'm plotting dataframes I always end up chaining df operations across different cells and this is putting a wrench in that.
You might argue it's for the best, but still 😅
Random question shot into the ether: if I'm relying on VSCode's interactive windows to emulate notebooks, what are some reasons why I should switch to @marimo.io notebooks?
I haven't looked into marimo's features, so maybe I'm missing out on things I can't do from VSCode.
That's me! It was a fun presentation and we got a lot of interesting questions
Also people laughed at the memes which is the most important thing, obviously
"ok the test run is done, let's see"
...
"this will be hard to debug"
What a banger is skrub @skrub-data.bsky.social !
Big thumbs up for the sklearn team & the maintainer of this package
Thanks a lot for the compliments! I had a lot of fun giving the talk, and I'm happy to see people liked it
My first actual talk in front of a ton of people 🙃
Do you have to deal with numerical features that involve large outliers, and need to train linear models or neural networks?
Then you might want to try the skrub SquashingScaler. The SquashingScaler behaves like scikit-learn RobustScaler, but smoothly clips outliers to predefined boundaries.
context: www.youtube.com/watch?v=f7Mi...
Working hard on the next @skrub-data.bsky.social slide deck...
Today at #EuroScipy2025, @glemaitre58.bsky.social and I presented a tutorial on pitfalls of machine learning for imbalanced classification problems.
We discussed what (not) to do when fitting a classifier and obtaining degenerate precision or recall values.
probabl-ai.github.io/calibration-...
📢 Talk Announcement
"Skrub: machine learning for dataframes", by Guillaume Lemaitre, Jérôme Dockès and @riccardocappuzzo.com.
@skrub-data.bsky.social
📜 Talk info: pretalx.com/pydata-paris-2025/talk/T9KTPU
📅 Schedule: pydata.org/paris2025/schedule
🎟 Tickets: pydata.org/paris2025/tickets
Photo of Riccardo presenting skrub DataOps in a lecture room to an audience of ~50 people.
Attending the @skrub-data.bsky.social tutorial by @riccardocappuzzo.com and @glemaitre58.bsky.social at #EuroScipy2025. They introduce the new DataOps feature released in skrub 0.6.
Here is the repo with the material for the tutorial: github.com/skrub-data/E...