Ricardo Castro (@mccricardo) Bsky

Auto-diagnosing Kubernetes alerts with HolmesGPT and CNCF tools What a two-person SRE team learned building an AI investigation pipeline. Spoiler: the runbooks mattered more than the model. At STCLab, our SRE team supports multiple Amazon EKS clusters running…

"Auto-diagnosing Kubernetes alerts with HolmesGPT and CNCF tools" by Grace Park and Ihyeok Song

www.cncf.io/blog/2026/04...

2 hours ago 0 0 0 0

Introducing Pyroscope 2.0: faster, more cost-effective continuous profiling at scale | Grafana Labs Pyroscope 2.0 makes continuous profiling practical at scale, reducing storage costs, simplifying operations, and helping you find performance bottlenecks faster.

"Introducing Pyroscope 2.0: faster, more cost-effective continuous profiling at scale" by Christian Simon

grafana.com/blog/pyrosco...

22 hours ago 0 0 0 0

The On-Call Problem AI Can Actually Solve SREcon EMEA Chair Heinrich Hartmann on why AI's highest-value SRE application isn't autonomous remediation — it's closing the on-call knowledge gap.

"The On-Call Problem AI Can Actually Solve" by Peter Farago

www.runllm.com/blog/the-on-...

1 day ago 0 0 0 0

K3s on On-Prem Infrastructures the GitOps Way: Writing a Custom k0rdent Template from Scratch Kubernetes turns 12 this year. In that time, it’s gone from a Google side project to the operating system of modern infrastructure running everywhere from mainframes to GPUs, across multi-cloud…

"K3s on On-Prem Infrastructures the GitOps Way: Writing a Custom k0rdent Template from Scratch" by hivani Rathod, and Prithvi Raj

www.cncf.io/blog/2026/04...

1 day ago 0 0 0 0

The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale By: Brett Axler, Casper Choffat, and Alo Lowry

"The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale" by Brett Axler, Casper Choffat, and Alo Lowry

netflixtechblog.com/the-human-in...

2 days ago 0 0 0 0

Driving change is *always* much harder than you might think in the begining.

When you get into the details, there's a lot you never even thought about.

2 days ago 0 0 0 0

What Is OpenTelemetry and Why It Matters A beginner-friendly introduction to OpenTelemetry — what it is, why to use it instead of a vendor's proprietary agent, and how to start with SigNoz.

"What Is OpenTelemetry and Why It Matters" by SigNoz

signoz.io/docs/overvie...

2 days ago 0 0 0 0

AWS Introduces S3 Files, Bringing File System Access to S3 Buckets AWS recently introduced S3 Files, which lets users mount an Amazon S3 bucket and access its data through a standard file system interface. Applications can read and write files using standard file…

"AWS Introduces S3 Files, Bringing File System Access to S3 Buckets" by Renato Losio

www.infoq.com/news/2026/04...

3 days ago 0 0 0 0

Revision 170 Articles and updates:

Revision 170 is out!

@koslib.com

#devops #sre #platformengineering

embracerisk.substack.com/p/revision-170

3 days ago 0 0 0 0

ingress-nginx to Envoy Gateway migration on CNCF internal services cluster CNCF hosts a Kubernetes cluster to run some services for internal purposes (namely; codimd, GUAC, kcp). The Kubernetes Project announced the ingress-nginx retirement (not to be confused with NGINX or…

"ingress-nginx to Envoy Gateway migration on CNCF internal services cluster" by Koray Oksay

www.cncf.io/blog/2026/04...

3 days ago 0 0 0 0

Whether you like it or not, being on-call is, probably, the best way for you to really understand how your systems really work.

4 days ago 0 0 0 0

How GitHub uses eBPF to improve deployment safety Learn how Github uses eBPF to detect and prevent circular dependencies in its deployment tooling.

"How GitHub uses eBPF to improve deployment safety" by Lawrence Gripper and Aleksey Levenstein

github.blog/engineering/...

4 days ago 0 0 0 0

Dogfooding and platforms: Spotify’s agentic-first development Gain exclusive access to the Spotify team to learn what it means that its best developers don't write code anymore. Instead they manage AI agents.

"Dogfooding and platforms: Spotify’s agentic-first development" by Jennifer Riggins

thenewstack.io/dogfooding-a...

5 days ago 0 0 0 0

I Was Wrong: OpenTelemetry is Great, the ecosystem around it is the problem The OpenTelemetry protocol is excellent. The ecosystem that grew around it still has a way to go.

"I Was Wrong: OpenTelemetry is Great, the ecosystem around it is the problem" by dusanstanojeviccs

medium.com/@dusan.stano...

5 days ago 0 0 0 0

"People are credulous creatures who find it very easy to believe and very difficult to doubt. In fact, believing is so easy, and perhaps so inevitable, that it may be more like involuntary comprehension than it is rational assessment"

Daniel Gilbert

6 days ago 0 0 0 0

GitHub - dominikhei/cardamon: Cardamon is a cleanup tool for Prometheus that collects unused metrics from Grafana and Prometheus and generates drop statements for them. Cardamon is a cleanup tool for Prometheus that collects unused metrics from Grafana and Prometheus and generates drop statements for them. - dominikhei/cardamon

"Cardamon is a metric auditor for Prometheus" by Dominik

github.com/dominikhei/c...

6 days ago 0 0 0 0

Introducing OTel Tracing in the Pulumi CLI The Pulumi CLI now supports OpenTelemetry tracing, replacing the deprecated OpenTracing integration. Learn how to export traces via gRPC or to a file.

"Introducing OTel Tracing in the Pulumi CLI" by Thomas Gummerer

www.pulumi.com/blog/introdu...

1 week ago 0 0 0 0

Kubernetes Is Eating Production: Why Usage Keeps Climbing Into 2026 Kubernetes isn’t just up in 2026; it’s becoming the default foundation for production software and AI. How can you run it safely, efficiently, at scale?

"Kubernetes Is Eating Production: Why Usage Keeps Climbing Into 2026" by Melissa Kapnick

www.fairwinds.com/blog/kuberne...

1 week ago 0 0 0 0

Inside Adobe's OpenTelemetry pipeline: simplicity at scale As part of an ongoing series, the Developer Experience SIG interviews organizations about their real-world OpenTelemetry Collector deployments to share practical lessons with the broader community.…

"Inside Adobe's OpenTelemetry pipeline: simplicity at scale" by Johanna Öjeling, Juliano Costa, Tristan Sloughter, Damien Mathieu, Bogdan Stancu

opentelemetry.io/blog/2026/de...

1 week ago 0 0 0 0

All-in-one incident management platform | incident.io incident.io is an all-in-one incident management platform unifying on-call scheduling, real-time incident response, and integrated status pages – helping teams resolve issues faster and reduce…

"The post-mortem problem" by incident.io

incident.io/blog/the-pos...

1 week ago 0 0 0 0

What Nobody Tells You When You Start in SRE - Uptime Labs I.e. the insights of Karan: a former Staff SRE with years of experience spanning software engineering, systems reliability and incident response. He shared his thoughts on what separates great SREs…

"What Nobody Tells You When You Start in SRE" by Karan Nagarajagowda

uptimelabs.io/articles/fir...

1 week ago 0 0 0 0

The three villains to agentic observability: retention, sampling and rollups Retention limits, sampling, and metric roll-ups aren't observability best practices - they're workarounds for storage systems that can't handle full-fidelity data, and they're becoming a hard blocker…

"The three villains to agentic observability: retention, sampling and rollups" by Mike Shi

clickhouse.com/blog/three-v...

1 week ago 0 0 0 0

Revision 169 Articles and updates:

Revision 169 is out!

@koslib.com

#devops #sre #platformengineering

embracerisk.substack.com/p/revision-169

1 week ago 0 0 0 0

Kubernetes Strategy: When It’s a Fit and Who Should Run It When is Kubernetes a good fit, when it’s overkill, what skills you need, and how to choose between running it yourself and using a managed service.

"Kubernetes Strategy: When It’s a Fit and Who Should Run It" by Andy Suderman

www.fairwinds.com/blog/kuberne...

1 week ago 0 0 0 0

"Trust But Canary: Configuration Safety at Scale"

engineering.fb.com/2026/04/08/s...

1 week ago 0 0 0 0

Building a Distributed Persistent Queue That Scaled AI Workloads 5x Explore how Salesforce designed a persistent queue that prevents autonomous agents and human workflows from overwhelming shared capacity and much more.

"Building a Distributed Persistent Queue That Scaled AI Workloads 5x Under LLM Rate Limits" by Karthik Premnath

engineering.salesforce.com/building-a-d...

1 week ago 0 0 0 0

AI is allowing us to tackle problems we, in past times, wouldn't even contemplate.

1 week ago 0 0 0 0

The Complete Guide to LLM Observability with OpenTelemetry How to instrument a real-world GenAI application with traces, metrics, and correlated logs, using a hands-on project.

"The Complete Guide to LLM Observability with OpenTelemetry" by Vprprudhvi

medium.com/@vprprudhvi/...

1 week ago 1 0 0 0

By making some things faster, AI is helping expose many roadblocks that already existed.

1 week ago 0 0 0 0

I think the belief that using coding agents is easy is a common misconception.

There's a lot experimentation required to make things work and the data suggests that most engineers/companies haven't figured it out yet.

1 week ago 0 0 0 0

Posts by Ricardo Castro