ClickHouse + S3 is just BigTable + Colossus?
(I've been writing my thesis over the last week and revisiting earlier work is producing random metadata thoughts in my head)
Posts by Ankush Jain
Actually no: LSM SSTables encode (key, offset) to allow value retrieval via one random read. Parquet never clusters its columns. But you could have hybrid layouts (some columns are clustered others are not) if you really wanted to.
This isn't a particularly timely insight but there's no difference between an LSM tree level and Parquet!
Wrote a hot-takes not-particularly-coherent post collating some RDMA ideas that had been accumulating in my head.
ankushja.in/blog/2025/rd...
Will let you know what I find when I get there
I'm coming around to the opinion that the harms from algorithmic feeds go beyond any single phenomenon, regardless of scale/impact. Recommendation algorithms should be regulated as a public health policy, not unlike polio or pandemic prevention.
Random update in the "something working somewhere" department.
I feel that "all models are wrong, some are useful" is a not-great take that has gained traction. We model for predictive value. We tolerate errors for compression and simplicity. The best models are at some pareto-frontier of complexity where they're not wrong, but just approximate.
Yes that's the second order thing, Copy types can apparently not be moved. Should just document intuitions as a blog post.
Sometimes I think about how a certain way of explaining things would click for me a lot better vs the canonical explanation. As a cpp programmer trying to pick up rust, one of them is "rust implements move semantics by default, to copy you clone manually and the clone gets moved."
Surfacing briefly to say that I hate it when segfaults become the mechanism to discover API assumptions. Rust can not happen fast enough.
The conditions that have led to what’s happening in the US today exist in democracies around the world.
They are an inevitable outcome of our collective failure to adapt to fundamental changes in the information ecosystem on which our democracies were originally built.
Spent two hours trying to look up one of the three references from a conference review that said "this is well-known and not novel". Citation formatting was pristine but reviewer names or conference names would not line up.
Then I looked up the second ref -- same. Third -- same. Then it struck me.
“unrelated supply chains”
as this from @weisenthal.bsky.social shows, I suspect a lot of supply chains are more related than they seem
One of the many good simulations of the weeks ahead...
After spending a couple days configuring an "industry-standard tool" that uses json as a specification interface, I have come to appreciate tools whose extensibility hooks are stadard programming languages. Makes sense why editors with vimscript/lua/lisp stuck around for decades.
Prometheus + Grafana, self-hosted, would be the most standard solution. Grafana has plugins for a bunch of different data sources, including an ODBC plugin that should work with most SQL DBs.
OTel AFAIK is more cumbersome to deal with than Prometheus.
Somewhat removed from reality but increasingly getting annoyed by gmail's line wrapping behavior. Ideally I think it should be around 60 chars for text and 80 for code blocks. OTOH I am making the most of my grad school days by refusing to send or read more than one email/week.
The massive context window is also amazing. I wrote a quick bash utility I call "llmcat" and dumping arbitrary subsets of code into a model is as simple as calling:
`fd -t f cc | llmcat | it2copy`
It copies all files in a "<filename>...</><filecontents>..</>" template... works so well!
So impressed by Gemini Pro... fixed a pointer lifecycle bug by inspecting memory pointers in lldb under the model's supervision. It flagged buffer reuse by spotting a pattern between some random input and buffer contents, and we worked backwards from there.
Apparently screen time in kids is highly correlated with other socioeconomic markers, and has gotten significantly worse for lower income groups over time
Ok my armchair optimism here is that napkins eventually make it to the Moleskine to keep journaling alive using some convoluted scheme, rather than doing away with it entirely.
I will get back to you once I work out what this means.
As an aside, wonder if we co-design parallel filesystems and tiered caches and MVCC.
Modifications to claimed objects could either be through the namespace, or you could obtain a lock over them, do whatever, and let the namespace know when you release the lock.
Wonder if you could do a parallel filesystem this way -- application creates objects in the object store, and the list of created objects is bulk-appended to a namespace asynchronously. Unclaimed objects get garbage-collected after say 48 hours of unuse.
Morning update: made sure to pack a drill and two chucks to fix the lab espresso machine but did not pack my laptop
Looks like Deepseek had a paper at SC with a bunch of details: dl.acm.org/doi/pdf/10.1...
They use(d to?) their own collective acceleration instead of NCCL, have a section on whether one should pay for NVLink, describe some incast mitigations in 3FS, + a bunch of other things.
Hmm looks like some mix of "things are going to be rough what'd you expect" and "some frontloaded something is messing with the forecast"
This can't be real?? I mean it probably is but it can't be?