"You need a vector database" is the default advice but complexity means there's new infrastructure, new formats, and data coordination overhead. What if tuning Parquet page sizes and adding lightweight footer metadata gave you vector search without any of that? blog.xiangpeng.systems/posts/vector...
Posts by Data Elixir
Most data bugs aren't algorithm failures, they're assumption violations upstream. Pointblank makes validation a first-class citizen in R/Python workflows instead of an afterthought scattered in assert statements.
Plot twist: You can have Normal residuals even when Y isn't Normal. And non-Normal residuals when Y actually is Normal. Model misspecification shows up everywhere except where you expect it.
Here's what they don't tell you about topic modeling: preprocessing determines everything. Corpus homogeneity, document length, and context windows matter more than algorithm choice. BERTopic won't fix bad inputs. No model will.
Indexes speed up reads but slow writes. The real insight: indexes only help when you're returning <15-20% of rows. Beyond that, sequential scans win. Space and memory costs matter too. btree indexes often exceed table size.
What if a model’s prediction was literally an eigenvalue? This post kicks off a thoughtful series exploring spectrum-based models as a middle ground between linear models and neural nets, with interpretability and robustness baked in. A fun, slightly off-the-beaten-path ML read.
Data To Art is a new curated gallery transforming datasets into visual storytelling. Latest addition: Alisa Singer's Environmental Graphiti turns climate science into vibrant data-driven art. The line between science and abstraction is thinner than you think.
ClickHouse solved Advent of Code 2025 puzzles using single ClickHouse queries. No UDFs, no temp tables, no preprocessing. Just pure SQL doing things SQL probably shouldn't do. Impressive and slightly cursed in the best way.
If you're wondering where database tech is really headed in 2026, Andy Pavlo's annual review cuts through the noise. Sharding, Postgres evolution, and hot takes you won't find in vendor blogs.
MarkItDown (Microsoft) converts a variety file types to Markdown optimized for LLMs. Handles structure, not just text. Pairs well with Kreuzberg for extraction (supports 50+ file formats!). If you're building RAG systems, these belong in your stack.
jax-js compiles NumPy-style array code into WebAssembly and WebGPU kernels that run entirely client-side. No server, no dependencies, just JAX's programming model in your browser. This changes what's possible for interactive ML demos.
Jan Van Haaren's 2025 soccer analytics review is a solid reference if you're doing applied ML in sports or want to see how domain experts handle sequential data. Covers spatio-temporal models, graph methods, Bayesian forecasting, and tracking data metrics. janvanhaaren.be/posts/soccer...
ML progress isn't driven by elegant theory. It's benchmarks, leaderboards, and engineering culture. In this post, Ben Recht explores why empirical testing beats clean math in practice and why that tension defines the field.
The paradox: show a complex graph and execs check their phones. Show a simple one and they demand endless breakdowns. The solution isn't more or less data. It's recentering discussions on the actual decision at hand. methodmatters.github.io/true-stories...
Your browser can run Python (Pyodide), execute OCR on PDFs, crop videos, and call LLM APIs—all without uploading anything to a server. The localStorage + CORS pattern makes surprisingly powerful tools possible with zero backend infrastructure.
Tired of charts that hide the story? Density plots + percentile intervals reveal what averages can't: full variability, extremes, historical context. Nine examples comparing Spain's temperature data show how geometry choice changes what readers understand.
Fisher arbitrarily chose p<0.05 a century ago and we've just... kept it. The problem: calling it "arbitrary" only works if you can suggest something less arbitrary. No one has, so here we are circling p-values close to 0.05 like it means something.
vilgot-huhn.github.io/mywebsite/po...
Haskell for data science? dataHaskell adds dataframes, NSE-style column operations, and compiler optimizations that turn chained operations into single-pass computations. Immutability + strong types + functional composition might be the combo we've been missing. jcarroll.com.au/2025/12/05/h...
Learning SQL is like learning a foreign language: you need to read more variations than you'll actually write. Learn disciplined canonical syntax for your own queries, but understand the messy dialects others use.
Side projects still open more doors than traditional applications in data roles. The trick: build small, interesting things that signal skills and make your work discoverable. Especially relevant given the current job market.
Most time-to-event metrics are broken. Amazon thought customer support wait times were under 1 min. Bezos called in: 10+ minutes. Why? They only measured customers who stayed on hold long enough to be served. The ones who hung up? Never counted. www.counting-stuff.com/why-you-are-...
Hot take: Your analysts doing "some basic data engineering" is killing your analytics function. The MTA hired 5 dedicated data engineers and it unlocked everything else. Stop asking data scientists to maintain pipelines. www.mta.info/article/less...
Sometimes the best polars pattern is knowing when to exit the DataFrame. partitionby() splits data into a dict of frames, letting you process with list comprehensions. Cleaner than forcing everything through mapgroups() when further wrangling isn't needed.
Most devs treat AI coding agents like infinite context machines. Reality: a 200k token window fills fast. The /compact feature is a trap. Better approach: /clear + document state in markdown, then resume. Treat context like disk space. You need a cleanup strategy.
Most modern dimensionality reduction (t-SNE, UMAP, Isomap) shares a pattern: represent data as a graph capturing local similarity, then embed to preserve that structure. It's graphs all the way down.
I curated some readings for class on "data tensions" and the list felt worth sharing. Come on a tour of datasets, books, the web, and AI with me...
We'll start with this piece on the Google Books project: the hopes, dreams, disasters, and aftermath of building a public library on the internet.
1/n
Everyone's rushing to pgvector for "simple" vector search in Postgres. This reality check shows what actually happens at scale: indexing nightmares and performance walls. Simple isn't always sustainable in production.
When healthcare becomes algorithmic, what gets optimized out? This Guardian essay asks the hard question about AI spreading through diagnostics and therapy: are we trading care quality for efficiency without realizing the cost?
Thinking Machines Lab solved a problem everyone accepted as unsolvable: LLM nondeterminism at temperature 0. Same prompt, same model, 1000 runs → 80 different outputs. With batch-invariant kernels? Bitwise identical every time. Open sourced. www.distributedthoughts.org/will-i-make-...
Most marketplaces have SKUs. Etsy has 100M+ unique items with no standard attributes. How do you build filters when one listing is a "porcelain sculpture that looks like a t-shirt" and dimensions live in random photo text? www.etsy.com/codeascraft/...