We now have datatoad's mdbook section on joins. It covers the planning of joins of abstract directed relations, and their execution by a mix of binary and worst-case optimal join stages.
It's pretty tight, but I don't think it's missing much except examples.
www.frankmcsherry.org/datatoad/cha...
Posts by Frank McSherry
Creating demos is definitely a thing it has made easier. And mini benchmarks to find the moment where a point becomes true (or .. not if never). I'll certainly be using it for this, though .. I have had to lean on it a few times to make sure the comparisons were fair.
As part of an attempt to get Claude to help me communicate better, we have .. a Claude-assisted mdbook for datatoad:
www.frankmcsherry.org/datatoad/int...
Still a work in progress, but it came together much faster with help.
Totally innocent question: I've found Claude has made me rather productive, at a rate that outpaces what I can write about solo. I'm trying to figure out if there is value/space for an intentionally "slop" line of posts that explains work that has been done, minus the alleged charm and certain salt.
The classic example of this is if you build an index with a key that has type TEXT (in PostgreSQL say), but you need an index of type VARCHAR. Not the same type, but an identical physical representation that is all we need once we've done method resolution.
A very nice post out of @materialize.com about "representation types" which remove non-structural distinctions between types as they approach the dataflow rendering layer, unlocking physical optimizations across logically distinct types:
materialize.com/blog/no-clas...
Moreover, I hadn't realized that the collection it develops iteratively contains only negative records. Very unorthodox!
Admittedly, its first version was worse, but this version that it got to was great. My version starts from all edges and removes failed proposals, but Claude realized that starting from non-empty collections requires a gadget that does work, and better to start from no rejections and build up.
A program that performs Gale-Shapely using a small language for differential dataflow.
Fun experiment du jour: can Claude write stable matching in differential dataflow using a simplified language that hides any of details of DD. Yes, it turns out! Correct out of the box, and with a bit of nudging it got something better than I have written:
A graph, with edges and differently colored operators, describing their function.
Here's an example, of label propagation on a graph. Salmon is feedback operators, purple are arrangements. The teal boundary is of an iterative scope.
Now with arrangement sizes, and due to go out with the next release of the crate.
With Claude, and a few lines of Rust, you can add a live timely dataflow visualizer to your programs, giving you visibility into where the time goes, and soon (surely) arrangement record counts as well!
The headline numbers are about ticking dataflows comprising 1,000,000 operators at ~4ms / iteration, rather than 350ms /iteration. If that sounds like a lot of dataflow operators, have I got news for you about SQL. :D
But also, it's an extreme number to highlight the problem, and distinct solution.
A new blog post about accelerating timely dataflow by 100x. Part of the story is how timely is able to do this, and .. seemingly .. other stream processor are not. It comes down to a fundamentally different approach to tracking time: globally rather than locally.
github.com/frankmcsherr...
Totally. It's not hyperbole at all. I feel like a whole team of principal engineers. Tbh on whether I produce work like them, or just feel that way, but .. yeah. :D
Claude declines to say that there is no planned biomass mulcher.
This was not as reassuring as I hoped.
Re-reading it, while Claude did find typos in the post, it didn't notice the grammatical error in the first sentence (racing to fix, in case you don't see it either by the time you read the post).
Next step is to tell it that grammatical errors count as typos, because how could I make mistakes?
I used Claude for the first time this month, having only ever used copilot before (and .. stopped immediately afterwards). It turns out .. I am maybe obsolete? We improved `columnar` together, and each wrote a post about it. (linked within my post).
A recent post by @dov.dev about Materialize's new Iceberg sink, and generally about connecting live operational data to analytics stores. Some interesting detail about (overcome) streaming-batch friction, and turning pristine CDC streams into .. Iceberg! materialize.com/blog/making-...
Four roles, in the colors of FFXIV.
Worlds collide: pretty sure someone at the MTA plays FFXIV:
I wrote a tiny vectorized interpreter last night, and .. thought I'd write up a post describing it. Nothing earth shattering here, but if you haven't seen one of these before it could be interesting! Probably easier to read than the J Incunabulum, but also less .. wow.
github.com/frankmcsherr...
We have a recent @materialize.com post from Jan Teske about how MZ's self-correcting materialized views work. It's imo a very cool thing that unpacks some of the magic about how this could possibly work as the underlying system evolves.
materialize.com/blog/self-co...
As examples: 1. path queries, which follow edges repeatedly, are recursive but acyclic; 2. triangle queries, which find triples where each pair has an edge, are cyclic but not recursive.
I don't have super-clear words to describe acyclicity, but this might help: pages.cs.wisc.edu/~paris/cs784...
Ah so the good news is that in this context "cyclic" and "recursive" are different, and recursive queries can be fine, unless they are cyclic. Recursive-but-acyclic queries should still work out great with the demand transform.
The transform is also very similar to Yannakakis's algorithm, which also restricts data down to those values that could possibly produce outputs. Except it also applies to recursive queries. Also like Yannakakis's algorithm, it can get tripped up by cyclic (distinct from recursive) joins patterns.
The transform can be incredibly powerful, especially in declarative languages where you are taught to state everything, even if you only need some of it. Many bottom-up engines will produce all of the facts of all relations you define, even if you could have done less to get specific outputs.
I wrote a bit about the "demand transform" for Datalog (and other similar languages): github.com/frankmcsherr...
The idea of the transform is that rather than eagerly produce all the data you might ever need, you can start from the seed crystals of concrete values and data that might yield output.
A Datalog program for determining iterates of the Collatz conjecture.
If anyone ever told you Datalog is elegant, they probably haven't used datatoad yet. The evenness condition is an interesting example of relation-rather-than-function, though!
Ooo, also we got a version of the Polonius rules (Rust borrowing) running and validated as well. It's in the same ballpark as the best four-worker configurations, but it does seem to be a bit slower than datafrog on the "naive" Polonius rules from seven years ago.
One interesting-to-me observation is that the M2 seems faster than the EPYC 7543, at least on four cores. About 3:2 faster, so some amount of the datatoad "speed-up" is potentially attributable to this.