Advertisement · 728 × 90

Posts by Ricardo Castro

Preview
Auto-diagnosing Kubernetes alerts with HolmesGPT and CNCF tools What a two-person SRE team learned building an AI investigation pipeline. Spoiler: the runbooks mattered more than the model. At STCLab, our SRE team supports multiple Amazon EKS clusters running…

"Auto-diagnosing Kubernetes alerts with HolmesGPT and CNCF tools" by Grace Park and Ihyeok Song

www.cncf.io/blog/2026/04...

2 hours ago 0 0 0 0
Preview
Introducing Pyroscope 2.0: faster, more cost-effective continuous profiling at scale | Grafana Labs Pyroscope 2.0 makes continuous profiling practical at scale, reducing storage costs, simplifying operations, and helping you find performance bottlenecks faster.

"Introducing Pyroscope 2.0: faster, more cost-effective continuous profiling at scale" by Christian Simon

grafana.com/blog/pyrosco...

22 hours ago 0 0 0 0
Preview
The On-Call Problem AI Can Actually Solve SREcon EMEA Chair Heinrich Hartmann on why AI's highest-value SRE application isn't autonomous remediation — it's closing the on-call knowledge gap.

"The On-Call Problem AI Can Actually Solve" by Peter Farago

www.runllm.com/blog/the-on-...

1 day ago 0 0 0 0
Preview
K3s on On-Prem Infrastructures the GitOps Way: Writing a Custom k0rdent Template from Scratch Kubernetes turns 12 this year. In that time, it’s gone from a Google side project to the operating system of modern infrastructure running everywhere from mainframes to GPUs, across multi-cloud…

"K3s on On-Prem Infrastructures the GitOps Way: Writing a Custom k0rdent Template from Scratch" by hivani Rathod, and Prithvi Raj

www.cncf.io/blog/2026/04...

1 day ago 0 0 0 0
Preview
The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale By: Brett Axler, Casper Choffat, and Alo Lowry

"The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale" by Brett Axler, Casper Choffat, and Alo Lowry

netflixtechblog.com/the-human-in...

2 days ago 0 0 0 0
Post image

Driving change is *always* much harder than you might think in the begining.

When you get into the details, there's a lot you never even thought about.

2 days ago 0 0 0 0
Preview
What Is OpenTelemetry and Why It Matters A beginner-friendly introduction to OpenTelemetry — what it is, why to use it instead of a vendor's proprietary agent, and how to start with SigNoz.

"What Is OpenTelemetry and Why It Matters" by SigNoz

signoz.io/docs/overvie...

2 days ago 0 0 0 0
Preview
AWS Introduces S3 Files, Bringing File System Access to S3 Buckets AWS recently introduced S3 Files, which lets users mount an Amazon S3 bucket and access its data through a standard file system interface. Applications can read and write files using standard file…

"AWS Introduces S3 Files, Bringing File System Access to S3 Buckets" by Renato Losio

www.infoq.com/news/2026/04...

3 days ago 0 0 0 0
Preview
Revision 170 Articles and updates:

Revision 170 is out!

@koslib.com

#devops #sre #platformengineering

embracerisk.substack.com/p/revision-170

3 days ago 0 0 0 0
Preview
ingress-nginx to Envoy Gateway migration on CNCF internal services cluster CNCF hosts a Kubernetes cluster to run some services for internal purposes (namely; codimd, GUAC, kcp). The Kubernetes Project announced the ingress-nginx retirement (not to be confused with NGINX or…

"ingress-nginx to Envoy Gateway migration on CNCF internal services cluster" by Koray Oksay

www.cncf.io/blog/2026/04...

3 days ago 0 0 0 0
Advertisement
Post image

Whether you like it or not, being on-call is, probably, the best way for you to really understand how your systems really work.

4 days ago 0 0 0 0
Preview
How GitHub uses eBPF to improve deployment safety Learn how Github uses eBPF to detect and prevent circular dependencies in its deployment tooling.

"How GitHub uses eBPF to improve deployment safety" by Lawrence Gripper and Aleksey Levenstein

github.blog/engineering/...

4 days ago 0 0 0 0
Preview
Dogfooding and platforms: Spotify’s agentic-first development Gain exclusive access to the Spotify team to learn what it means that its best developers don't write code anymore. Instead they manage AI agents.

"Dogfooding and platforms: Spotify’s agentic-first development" by Jennifer Riggins

thenewstack.io/dogfooding-a...

5 days ago 0 0 0 0
Preview
I Was Wrong: OpenTelemetry is Great, the ecosystem around it is the problem The OpenTelemetry protocol is excellent. The ecosystem that grew around it still has a way to go.

"I Was Wrong: OpenTelemetry is Great, the ecosystem around it is the problem" by dusanstanojeviccs

medium.com/@dusan.stano...

5 days ago 0 0 0 0
Post image

"People are credulous creatures who find it very easy to believe and very difficult to doubt. In fact, believing is so easy, and perhaps so inevitable, that it may be more like involuntary comprehension than it is rational assessment"

Daniel Gilbert

6 days ago 0 0 0 0
Preview
GitHub - dominikhei/cardamon: Cardamon is a cleanup tool for Prometheus that collects unused metrics from Grafana and Prometheus and generates drop statements for them. Cardamon is a cleanup tool for Prometheus that collects unused metrics from Grafana and Prometheus and generates drop statements for them. - dominikhei/cardamon

"Cardamon is a metric auditor for Prometheus" by Dominik

github.com/dominikhei/c...

6 days ago 0 0 0 0
Preview
Introducing OTel Tracing in the Pulumi CLI The Pulumi CLI now supports OpenTelemetry tracing, replacing the deprecated OpenTracing integration. Learn how to export traces via gRPC or to a file.

"Introducing OTel Tracing in the Pulumi CLI" by Thomas Gummerer

www.pulumi.com/blog/introdu...

1 week ago 0 0 0 0
Preview
Kubernetes Is Eating Production: Why Usage Keeps Climbing Into 2026 Kubernetes isn’t just up in 2026; it’s becoming the default foundation for production software and AI. How can you run it safely, efficiently, at scale?

"Kubernetes Is Eating Production: Why Usage Keeps Climbing Into 2026" by Melissa Kapnick

www.fairwinds.com/blog/kuberne...

1 week ago 0 0 0 0
Preview
Inside Adobe's OpenTelemetry pipeline: simplicity at scale As part of an ongoing series, the Developer Experience SIG interviews organizations about their real-world OpenTelemetry Collector deployments to share practical lessons with the broader community.…

"Inside Adobe's OpenTelemetry pipeline: simplicity at scale" by Johanna Öjeling, Juliano Costa, Tristan Sloughter, Damien Mathieu, Bogdan Stancu

opentelemetry.io/blog/2026/de...

1 week ago 0 0 0 0
Advertisement
Preview
All-in-one incident management platform | incident.io incident.io is an all-in-one incident management platform unifying on-call scheduling, real-time incident response, and integrated status pages – helping teams resolve issues faster and reduce…

"The post-mortem problem" by incident.io

incident.io/blog/the-pos...

1 week ago 0 0 0 0
Preview
What Nobody Tells You When You Start in SRE - Uptime Labs I.e. the insights of Karan: a former Staff SRE with years of experience spanning software engineering, systems reliability and incident response. He shared his thoughts on what separates great SREs…

"What Nobody Tells You When You Start in SRE" by Karan Nagarajagowda

uptimelabs.io/articles/fir...

1 week ago 0 0 0 0
Preview
The three villains to agentic observability: retention, sampling and rollups Retention limits, sampling, and metric roll-ups aren't observability best practices - they're workarounds for storage systems that can't handle full-fidelity data, and they're becoming a hard blocker…

"The three villains to agentic observability: retention, sampling and rollups" by Mike Shi

clickhouse.com/blog/three-v...

1 week ago 0 0 0 0
Preview
Revision 169 Articles and updates:

Revision 169 is out!

@koslib.com

#devops #sre #platformengineering

embracerisk.substack.com/p/revision-169

1 week ago 0 0 0 0
Preview
Kubernetes Strategy: When It’s a Fit and Who Should Run It When is Kubernetes a good fit, when it’s overkill, what skills you need, and how to choose between running it yourself and using a managed service.

"Kubernetes Strategy: When It’s a Fit and Who Should Run It" by Andy Suderman

www.fairwinds.com/blog/kuberne...

1 week ago 0 0 0 0

"Trust But Canary: Configuration Safety at Scale"

engineering.fb.com/2026/04/08/s...

1 week ago 0 0 0 0
Preview
Building a Distributed Persistent Queue That Scaled AI Workloads 5x Explore how Salesforce designed a persistent queue that prevents autonomous agents and human workflows from overwhelming shared capacity and much more.

"Building a Distributed Persistent Queue That Scaled AI Workloads 5x Under LLM Rate Limits" by Karthik Premnath

engineering.salesforce.com/building-a-d...

1 week ago 0 0 0 0
Advertisement
Post image

AI is allowing us to tackle problems we, in past times, wouldn't even contemplate.

1 week ago 0 0 0 0
Preview
The Complete Guide to LLM Observability with OpenTelemetry How to instrument a real-world GenAI application with traces, metrics, and correlated logs, using a hands-on project.

"The Complete Guide to LLM Observability with OpenTelemetry" by Vprprudhvi

medium.com/@vprprudhvi/...

1 week ago 1 0 0 0
Post image

By making some things faster, AI is helping expose many roadblocks that already existed.

1 week ago 0 0 0 0
Post image

I think the belief that using coding agents is easy is a common misconception.

There's a lot experimentation required to make things work and the data suggests that most engineers/companies haven't figured it out yet.

1 week ago 0 0 0 0