There’s still room for a few more at the #ISC25 OpenCHAMI tutorial tomorrow. Come join us and learn why the community is growing so fast! #HPC
app.swapcard.com/event/isc-hi...
Posts by Alex Lovell-Troy
Want to move beyond xcat with a provisioner that’s ready for Confidential Computing?
I’ll be at #ISC25 this week talking about OpenCHAMI.
Free and Open Source with a growing community.
openchami.org
I’ve been working on OpenCHAMI for a couple of years now. This is an exciting step for the community!
I’ve been working on OpenCHAMI for a couple of years now. This is an exciting step for the community!
This is the worst thing I can tell you about Japan.
I spent a week in Japan with wet hands before anyone told me that I needed to carry my own hand towel. Apparently they’ve been pulling this shit on foreigners for centuries.
Inside of you there are two wolves. Inside each wolf there are zero, one, or two wolves. Write a function to rebalance an arbitrary wolftree B such that it has minimal depth. The function should execute in O(logn) time. Show your work.
That’s really good to know. Thanks!
🌟 Hello BlueSky! 🌟
We’re Honeycomb, the observability platform for teams who manage software that matters. Send any data to our one-of-a-kind data store, solve problems with all the relevant context, and fix issues before your customers find them.
*no lies detected*
Twizzlers are made of braces wax and taste amazing when used as straws for Cherry Coke.
Moving from cloud #SRE to #HPC often means recalibrating what metrics matter.
Time to job launch?
Time to completion?
Mean time to job failure?
Time to snapshot recovery?
Cloud makes node loss a non-event. HPC typically doesn’t work that way.
Isambard 3 Supercomputer: Image Credit: Christy Nunns/University of Bristol
GW4 Isambard 3 #Supercomputer is officially online🎉🧠 !
Part of a collaboration between the universities of Bath, Bristol, Cardiff and Exeter, alongside partners HPE, NVIDIA and Arm, Isambard 3 will push the boundaries of science.
🔗 https://buff.ly/4g7HtMK
I once had to email someone with an important corporate email address.
Last Name: Fuchs
First Initial: E
Inexplicable Extra Letter: X
That’s right fuchsex was his official email address.
I often wonder why the X.
“It’s like watching someone unlock a padlock on a wrench so they can use it to drive a nail”
Why?
“The padlock doesn’t fit the hammer.”
Every single time I’ve been to Bristol, the weather has been fantastic. Highly recommend.
I spent the Thanksgiving break typing up my notes from #SC24 which I've posted online. 30% more words than my notes from SC23 (sorry!). Feedback is welcome!
https://buff.ly/41fBhho
#HPC
Picture of two slices of bread, one stacked on top of the other crust side, with ham and cheese in between
This is, technically, a sandwich.
Lotsa fake Astronomy photos on Bluesky these days. Just remember, if they’re not credited, it’s not credible.
NASA has the original feed and does a good job of curation.
apod.nasa.gov/apod/
Cool. Do you know of any large systems that use the feature? Does it help improve boot timing or scalability?
As it turns out, when the Thanksgiving pies don’t last all weekend, you’re allowed to make more pie. Who’s going to stop you?
🥧 Maple Pumpkin
🥧 Bourbon Apple
I should write a bittorrent client
Let me guess, high stress but only for a few minutes each day.
High cardinality exploration is a super power for SRE. Honeycomb changed so much
Tail latency has entered the chat!
I'd also recommend looking at these metrics broken down by user/project, and try to make sure your 1% least reliable subset is still doing ok, or at least getting support, since failures are often not evenly distributed.
I really like this post on the topic: rachelbythebay.com/w/2019/07/15...
I have a half-written blog post about this that I should finish sometime.
I haven’t seen an SLO framework broadly adopted in HPC, but some sites adopt metrics like:
- % nodes up
- Scheduler RPC latency
- FS latency and BW
- Performance on standard benchmarks, either after maintenance or weekly
I totally agree. Feels like a good provocative talk for SRECon, especially as cloud SRE folks are being asked to support large training systems for AI.