my takeaway today: I have a masculine tone 😂
Posts by Sarah Krasnik Bedell
after a few failed attempts at tone editing, this was unfortunately effective
Yeah that's what I'm thinking - pile definitely bad. Single feels so convenient, but wondering if that should be done or not. To be clear, I do it all the time too for convenience
Real talk: when is try/except to suppress an exception lazy (compared to checking the upstream data) vs the most efficient implementation?
I've done this plenty of times to only call out an API once instead of twice, but now looking back it feels like bad practice
#dataBS
Yeah I used (or tried to use) Airflow but it kept eating so much RAM I could not do anything else on my laptop
Switched to prefect and never looked back
I wish it was more widespread in the industry
we live in the land where bon jovi and pitbull can do a collab for a new rendition of now or never. the american empire not only will last 1000 years, but it deserves to.
Alright #dataBS: for anyone using Databricks -
What do you mostly use it for? What made you choose the tool? Where do you find it solving your problems most?
If you're trying to grow as an IC: data engineering (requirement for every other data function)
If you're trying to run a data team: analytics (learning to work with stakeholders)
Or, use data as a gateway to learning and evolve your career once again
I guess what I mean is, OL is a framework, but relies on other tools to be useful. I'm thinking about where we will go for the one stop shop of answering - "this thing failed, why tho"
I'm curious why OL?
I recently watched the airflow summit 2023 video on it - isn't it just an Airflow plugin for dags that relies on manual hooks and lacks deep integration with data or infra assets? I'd also expect some UI around lineage.
If I'm making naive assumptions correct me
Yup 100%. But then tie that audit log to actual assets.
I think we need to step back and define lineage. Before we defined it just in terms of data assets.
But what if you're running a python ETL process pre-warehouse and your infra dies? The output of that job would be out of date.
That's also lineage, and not in SQL. So we need to solve for that too.
Not quite sure if your exact use case, but checkout @prefect.io. retries, logging, and caching right out the box
A little bit of a different flavor, but I wrote this back in 2022 and feel like it still mostly applies today
sarahsnewsletter.substack.com/p/everyone-s...
Query languages for the SQL-esque ones, and data python packages for the others (to be exact)
I feel like 2022 was the year we tried to solve lineage with observability tools, got decently far but not far enough to fully understand failures, so we settled for alerting on failures instead.
Is 2025 going to be the year we solve for true lineage outside of the data warehouse?
#dataBS
So: make sure to run only your ML work on expensive GPUs, and run your lightweight ETL on small compute and utilize the warehouse credits you need to use before 2025 instead.
I'm hearing this is a problem when data eng / data platform become different teams.
Who's encountered this?
#dataBS
Going from dev to prod in literally anything should be easy.
This is still an unsolved problem.
Devops is a blocke, and data teams are still trying to figure out IaC.
On my mind today: something as sinple as dynamic work pools in Prefect could solve this.
#dataBS
Auto spin up / spin down is not flashy these days, it's table stakes. Infra is so expensive - anything that can help save infra cost pays dividends.
Enjoyed this piece by @sarahkb.bsky.social on measuring PLS vs PLG, where the later mostly doesn't work for enterprise sales. We had to figure out how to measure PLS early because Coginiti targeted customers in highly secure industries like gov, finance, & insurance.
Reddit is quickly growing it's user base and content - there's definitely more mess than before, but I've found posting genuine, detailed comments get engagement.
PS ignore the trolls, only way forward
Sure that's one use case
But what if an event happens but it's throttled to only run a thing every 5 min? Then it's not realtime
I think realtime is about the SLA of the output the event is triggering
So there's a venn diagram with an overlapping middle
Totally fair. I do think oftentimes realtime and event based get confused as one, which they're not
Event based != realtime
Event based jobs are often batch, with event triggers used to optimize running only when needed (most common).
True realtime reqs are concentrated more in certain industries - finance, logistics, user-facing analytics (and I'm sure others I missed).
#dataBS change my mind
Claude can adjust to tone so much better (even with equal context). I'm a convert
Modularity - so you can refactor one piece at a time
But the point here is OSS is used in a POC not prod deployment is what you're saying?
A convert, we love to see it
100000% agree. OSS is the best way to POC
But using it in prod as the end all be all is a different story
I'm disappointed no one yet has said the on prem server room