Frank McSherry (@frankmcsherry) Bsky

Joins

We now have datatoad's mdbook section on joins. It covers the planning of joins of abstract directed relations, and their execution by a mix of binary and worst-case optimal join stages.

It's pretty tight, but I don't think it's missing much except examples.

www.frankmcsherry.org/datatoad/cha...

2 weeks ago 2 0 0 0

Creating demos is definitely a thing it has made easier. And mini benchmarks to find the moment where a point becomes true (or .. not if never). I'll certainly be using it for this, though .. I have had to lean on it a few times to make sure the comparisons were fair.

3 weeks ago 1 0 0 0

Datatoad

As part of an attempt to get Claude to help me communicate better, we have .. a Claude-assisted mdbook for datatoad:

www.frankmcsherry.org/datatoad/int...

Still a work in progress, but it came together much faster with help.

3 weeks ago 5 1 0 0

Totally innocent question: I've found Claude has made me rather productive, at a rate that outpaces what I can write about solo. I'm trying to figure out if there is value/space for an intentionally "slop" line of posts that explains work that has been done, minus the alleged charm and certain salt.

3 weeks ago 1 0 2 0

The classic example of this is if you build an index with a key that has type TEXT (in PostgreSQL say), but you need an index of type VARCHAR. Not the same type, but an identical physical representation that is all we need once we've done method resolution.

3 weeks ago 0 0 0 0

No Classification without Represention | Materialize Learn how Materialize improves query performance by compiling SQL's type system to a simpler system of

A very nice post out of @materialize.com about "representation types" which remove non-structural distinctions between types as they approach the dataflow rendering layer, unlocking physical optimizations across logically distinct types:

materialize.com/blog/no-clas...

3 weeks ago 4 0 1 0

Moreover, I hadn't realized that the collection it develops iteratively contains only negative records. Very unorthodox!

4 weeks ago 0 0 0 0

Admittedly, its first version was worse, but this version that it got to was great. My version starts from all edges and removes failed proposals, but Claude realized that starting from non-empty collections requires a gadget that does work, and better to start from no rejections and build up.

4 weeks ago 0 0 1 0

A program that performs Gale-Shapely using a small language for differential dataflow.

Fun experiment du jour: can Claude write stable matching in differential dataflow using a simplified language that hides any of details of DD. Yes, it turns out! Correct out of the box, and with a bit of nudging it got something better than I have written:

4 weeks ago 6 0 1 0

A graph, with edges and differently colored operators, describing their function.

Here's an example, of label propagation on a graph. Salmon is feedback operators, purple are arrangements. The teal boundary is of an iterative scope.

1 month ago 1 0 0 0

Now with arrangement sizes, and due to go out with the next release of the crate.

1 month ago 0 0 1 0

With Claude, and a few lines of Rust, you can add a live timely dataflow visualizer to your programs, giving you visibility into where the time goes, and soon (surely) arrangement record counts as well!

1 month ago 3 0 1 0

The headline numbers are about ticking dataflows comprising 1,000,000 operators at ~4ms / iteration, rather than 350ms /iteration. If that sounds like a lot of dataflow operators, have I got news for you about SQL. :D

But also, it's an extreme number to highlight the problem, and distinct solution.

1 month ago 0 0 0 0

A new blog post about accelerating timely dataflow by 100x. Part of the story is how timely is able to do this, and .. seemingly .. other stream processor are not. It comes down to a fundamentally different approach to tracking time: globally rather than locally.

github.com/frankmcsherr...

1 month ago 3 0 1 0

Totally. It's not hyperbole at all. I feel like a whole team of principal engineers. Tbh on whether I produce work like them, or just feel that way, but .. yeah. :D

1 month ago 2 1 1 0

Claude declines to say that there is no planned biomass mulcher.

This was not as reassuring as I hoped.

1 month ago 4 0 2 0

Re-reading it, while Claude did find typos in the post, it didn't notice the grammatical error in the first sentence (racing to fix, in case you don't see it either by the time you read the post).

Next step is to tell it that grammatical errors count as typos, because how could I make mistakes?

1 month ago 0 0 1 0

I used Claude for the first time this month, having only ever used copilot before (and .. stopped immediately afterwards). It turns out .. I am maybe obsolete? We improved `columnar` together, and each wrote a post about it. (linked within my post).

1 month ago 11 0 1 0

Making Iceberg Work for Operational Data | Materialize Apache Iceberg was built for batch analytics — but operational data changes continuously. Learn how Materialize streams live, transactionally consistent data into Iceberg without the memory and latenc...

A recent post by @dov.dev about Materialize's new Iceberg sink, and generally about connecting live operational data to analytics stores. Some interesting detail about (overcome) streaming-batch friction, and turning pristine CDC streams into .. Iceberg! materialize.com/blog/making-...

1 month ago 6 2 0 0

Four roles, in the colors of FFXIV.

Worlds collide: pretty sure someone at the MTA plays FFXIV:

1 month ago 0 0 0 0

I wrote a tiny vectorized interpreter last night, and .. thought I'd write up a post describing it. Nothing earth shattering here, but if you haven't seen one of these before it could be interesting! Probably easier to read than the J Incunabulum, but also less .. wow.

github.com/frankmcsherr...

1 month ago 9 2 0 0

Self-Correcting Materialized Views | Materialize Learn how Materialize uses self-correction to prevent output drift in materialized views, ensure consistency across upgrades, and enable in-place view replacement.

We have a recent @materialize.com post from Jan Teske about how MZ's self-correcting materialized views work. It's imo a very cool thing that unpacks some of the magic about how this could possibly work as the underlying system evolves.

materialize.com/blog/self-co...

1 month ago 2 0 0 0

As examples: 1. path queries, which follow edges repeatedly, are recursive but acyclic; 2. triangle queries, which find triples where each pair has an edge, are cyclic but not recursive.

I don't have super-clear words to describe acyclicity, but this might help: pages.cs.wisc.edu/~paris/cs784...

2 months ago 1 0 1 0

Ah so the good news is that in this context "cyclic" and "recursive" are different, and recursive queries can be fine, unless they are cyclic. Recursive-but-acyclic queries should still work out great with the demand transform.

2 months ago 0 0 1 0

The transform is also very similar to Yannakakis's algorithm, which also restricts data down to those values that could possibly produce outputs. Except it also applies to recursive queries. Also like Yannakakis's algorithm, it can get tripped up by cyclic (distinct from recursive) joins patterns.

3 months ago 0 0 1 0

The transform can be incredibly powerful, especially in declarative languages where you are taught to state everything, even if you only need some of it. Many bottom-up engines will produce all of the facts of all relations you define, even if you could have done less to get specific outputs.

3 months ago 0 0 1 0

I wrote a bit about the "demand transform" for Datalog (and other similar languages): github.com/frankmcsherr...

The idea of the transform is that rather than eagerly produce all the data you might ever need, you can start from the seed crystals of concrete values and data that might yield output.

3 months ago 2 0 1 0

A Datalog program for determining iterates of the Collatz conjecture.

If anyone ever told you Datalog is elegant, they probably haven't used datatoad yet. The evenness condition is an interesting example of relation-rather-than-function, though!

3 months ago 1 0 0 0

Ooo, also we got a version of the Polonius rules (Rust borrowing) running and validated as well. It's in the same ballpark as the best four-worker configurations, but it does seem to be a bit slower than datafrog on the "naive" Polonius rules from seven years ago.

3 months ago 0 0 0 0

One interesting-to-me observation is that the M2 seems faster than the EPYC 7543, at least on four cores. About 3:2 faster, so some amount of the datatoad "speed-up" is potentially attributable to this.

3 months ago 0 0 1 0

Posts by Frank McSherry