I want bare metal instances that can launch within 2-3 seconds, for a better (local dev <-> remote execution) REPL workflow
vs. fast launching containers (eg, Cloud Run), bare metal gives me more reliable benchmark measurements and the ability to bring tools like Nsight
Posts by Ashton Six
i've found some success sticking to SIMD-friendly scalar patterns
i get loop order right (polyhedral analysis), add hints like `#pragma omp simd`, and run at -O3: that's _usually_ enough. you can check output with -S (gives readable ASM)
or use SIMDe, that works too
This feels like a continuation of the reduction operators introduced in Blackwell's TMA (cp.reduce.async.bulk). Fun fact! Data movement often dominates power usage vs compute because of physics: thicker longer wires = more power needed to transmit each bit. Makes a lot of sense to optimise here.
Mmm! Nice corollary: software optimisations for prefix sums (re-parenthesizing) generalise across associative ops: +, ^, prefix-of-prefix.
I made a thread about it: bsky.app/profile/asht...
Full write-up, implementation (NEON) and benchmark results (Graviton4) here: github.com/ashtonsix/pe...
I love solving these kinds of performance puzzles—and I'm currently available for hire! Reach out if interested 😊. 3/3
The ILP trick:
# Local prefix sums
out[0..3] = prefix(in[0..3])
out[4..7] = prefix(in[4..7])
...
# Late carry broadcast (redundant compute)
out[4..7] += out[3];
out[8..11] += out[7];
...
By delaying the carry we allow the CPU to compute all local prefix sums in parallel, >doubling throughput. 2/
I got SOTA (L1-hot, SIMD) on prefix sum by ADDING instructions (7.7 GB/s → 19.8 GB/s). Consider:
for i = 0..n: out[i] = out[i-1] + in[i]
This SUCKS, because out[i] must wait on out[i-1]. There's an unbroken dependency chain which disrupts Instruction Level Parrallelism (ILP). 1/