Advertisement · 728 × 90

Posts by Alex Lovell-Troy

Client Challenge

There’s still room for a few more at the #ISC25 OpenCHAMI tutorial tomorrow. Come join us and learn why the community is growing so fast! #HPC

app.swapcard.com/event/isc-hi...

10 months ago 0 0 0 0

Want to move beyond xcat with a provisioner that’s ready for Confidential Computing?

I’ll be at #ISC25 this week talking about OpenCHAMI.

Free and Open Source with a growing community.

openchami.org

10 months ago 1 0 0 0

I’ve been working on OpenCHAMI for a couple of years now. This is an exciting step for the community!

1 year ago 5 2 1 0

I’ve been working on OpenCHAMI for a couple of years now. This is an exciting step for the community!

1 year ago 5 2 1 0

This is the worst thing I can tell you about Japan.

1 year ago 2 0 0 0

I spent a week in Japan with wet hands before anyone told me that I needed to carry my own hand towel. Apparently they’ve been pulling this shit on foreigners for centuries.

1 year ago 2 0 0 1

Inside of you there are two wolves. Inside each wolf there are zero, one, or two wolves. Write a function to rebalance an arbitrary wolftree B such that it has minimal depth. The function should execute in O(logn) time. Show your work.

1 year ago 323 59 11 3

That’s really good to know. Thanks!

1 year ago 1 0 0 0
Advertisement

🌟 Hello BlueSky! 🌟

We’re Honeycomb, the observability platform for teams who manage software that matters. Send any data to our one-of-a-kind data store, solve problems with all the relevant context, and fix issues before your customers find them.

1 year ago 60 4 3 3

*no lies detected*

1 year ago 1 0 0 0

Twizzlers are made of braces wax and taste amazing when used as straws for Cherry Coke.

1 year ago 1 0 1 0

Moving from cloud #SRE to #HPC often means recalibrating what metrics matter.

Time to job launch?
Time to completion?
Mean time to job failure?
Time to snapshot recovery?

Cloud makes node loss a non-event. HPC typically doesn’t work that way.

1 year ago 10 2 2 0
Isambard 3 Supercomputer: Image Credit: Christy Nunns/University of Bristol

Isambard 3 Supercomputer: Image Credit: Christy Nunns/University of Bristol

GW4 Isambard 3 #Supercomputer is officially online🎉🧠 !

Part of a collaboration between the universities of Bath, Bristol, Cardiff and Exeter, alongside partners HPE, NVIDIA and Arm, Isambard 3 will push the boundaries of science.

🔗 https://buff.ly/4g7HtMK

1 year ago 14 10 1 0

I once had to email someone with an important corporate email address.

Last Name: Fuchs
First Initial: E
Inexplicable Extra Letter: X

That’s right fuchsex was his official email address.

I often wonder why the X.

1 year ago 1 0 2 0

“It’s like watching someone unlock a padlock on a wrench so they can use it to drive a nail”

Why?

“The padlock doesn’t fit the hammer.”

1 year ago 0 0 0 0
Advertisement

Every single time I’ve been to Bristol, the weather has been fantastic. Highly recommend.

1 year ago 3 0 1 0
Preview
SC'24 recap The premiere annual conference of the high-performance computing community, SC24, was held in Atlanta last week, and it attracted a reco...

I spent the Thanksgiving break typing up my notes from #SC24 which I've posted online. 30% more words than my notes from SC23 (sorry!). Feedback is welcome!

https://buff.ly/41fBhho

#HPC

1 year ago 53 9 7 2
Picture of two slices of bread, one stacked on top of the other crust side, with ham and cheese in between

Picture of two slices of bread, one stacked on top of the other crust side, with ham and cheese in between

This is, technically, a sandwich.

1 year ago 8563 1407 406 357
Astronomy Picture of the Day A different astronomy and space science related image is featured each day, along with a brief explanation.

Lotsa fake Astronomy photos on Bluesky these days. Just remember, if they’re not credited, it’s not credible.

NASA has the original feed and does a good job of curation.

apod.nasa.gov/apod/

1 year ago 3 0 0 0

Cool. Do you know of any large systems that use the feature? Does it help improve boot timing or scalability?

1 year ago 0 0 1 0

As it turns out, when the Thanksgiving pies don’t last all weekend, you’re allowed to make more pie. Who’s going to stop you?

🥧 Maple Pumpkin
🥧 Bourbon Apple

1 year ago 1 0 0 0

I should write a bittorrent client

1 year ago 1073 20 60 8

Let me guess, high stress but only for a few minutes each day.

1 year ago 0 0 1 0

High cardinality exploration is a super power for SRE. Honeycomb changed so much

1 year ago 1 0 0 0

Tail latency has entered the chat!

1 year ago 1 0 1 0
Advertisement

I'd also recommend looking at these metrics broken down by user/project, and try to make sure your 1% least reliable subset is still doing ok, or at least getting support, since failures are often not evenly distributed.

I really like this post on the topic: rachelbythebay.com/w/2019/07/15...

1 year ago 6 1 1 1

I have a half-written blog post about this that I should finish sometime.

I haven’t seen an SLO framework broadly adopted in HPC, but some sites adopt metrics like:

- % nodes up
- Scheduler RPC latency
- FS latency and BW
- Performance on standard benchmarks, either after maintenance or weekly

1 year ago 5 1 2 0

I totally agree. Feels like a good provocative talk for SRECon, especially as cloud SRE folks are being asked to support large training systems for AI.

1 year ago 2 0 0 0

I don’t see a lot of talk about #SLO (Service Level Objectives) for administering #HPC clusters. Does anyone have good examples beyond “the cluster is not down”?

1 year ago 5 2 2 0

Moving from cloud #SRE to #HPC often means recalibrating what metrics matter.

Time to job launch?
Time to completion?
Mean time to job failure?
Time to snapshot recovery?

Cloud makes node loss a non-event. HPC typically doesn’t work that way.

1 year ago 10 2 2 0