"Auto-diagnosing Kubernetes alerts with HolmesGPT and CNCF tools" by Grace Park and Ihyeok Song
www.cncf.io/blog/2026/04...
Posts by Ricardo Castro
"Introducing Pyroscope 2.0: faster, more cost-effective continuous profiling at scale" by Christian Simon
grafana.com/blog/pyrosco...
"K3s on On-Prem Infrastructures the GitOps Way: Writing a Custom k0rdent Template from Scratch" by hivani Rathod, and Prithvi Raj
www.cncf.io/blog/2026/04...
"The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale" by Brett Axler, Casper Choffat, and Alo Lowry
netflixtechblog.com/the-human-in...
Driving change is *always* much harder than you might think in the begining.
When you get into the details, there's a lot you never even thought about.
"AWS Introduces S3 Files, Bringing File System Access to S3 Buckets" by Renato Losio
www.infoq.com/news/2026/04...
Revision 170 is out!
@koslib.com
#devops #sre #platformengineering
embracerisk.substack.com/p/revision-170
"ingress-nginx to Envoy Gateway migration on CNCF internal services cluster" by Koray Oksay
www.cncf.io/blog/2026/04...
Whether you like it or not, being on-call is, probably, the best way for you to really understand how your systems really work.
"How GitHub uses eBPF to improve deployment safety" by Lawrence Gripper and Aleksey Levenstein
github.blog/engineering/...
"Dogfooding and platforms: Spotify’s agentic-first development" by Jennifer Riggins
thenewstack.io/dogfooding-a...
"I Was Wrong: OpenTelemetry is Great, the ecosystem around it is the problem" by dusanstanojeviccs
medium.com/@dusan.stano...
"People are credulous creatures who find it very easy to believe and very difficult to doubt. In fact, believing is so easy, and perhaps so inevitable, that it may be more like involuntary comprehension than it is rational assessment"
Daniel Gilbert
"Kubernetes Is Eating Production: Why Usage Keeps Climbing Into 2026" by Melissa Kapnick
www.fairwinds.com/blog/kuberne...
"Inside Adobe's OpenTelemetry pipeline: simplicity at scale" by Johanna Öjeling, Juliano Costa, Tristan Sloughter, Damien Mathieu, Bogdan Stancu
opentelemetry.io/blog/2026/de...
"The three villains to agentic observability: retention, sampling and rollups" by Mike Shi
clickhouse.com/blog/three-v...
Revision 169 is out!
@koslib.com
#devops #sre #platformengineering
embracerisk.substack.com/p/revision-169
"Kubernetes Strategy: When It’s a Fit and Who Should Run It" by Andy Suderman
www.fairwinds.com/blog/kuberne...
"Trust But Canary: Configuration Safety at Scale"
engineering.fb.com/2026/04/08/s...
"Building a Distributed Persistent Queue That Scaled AI Workloads 5x Under LLM Rate Limits" by Karthik Premnath
engineering.salesforce.com/building-a-d...
AI is allowing us to tackle problems we, in past times, wouldn't even contemplate.
"The Complete Guide to LLM Observability with OpenTelemetry" by Vprprudhvi
medium.com/@vprprudhvi/...
By making some things faster, AI is helping expose many roadblocks that already existed.
I think the belief that using coding agents is easy is a common misconception.
There's a lot experimentation required to make things work and the data suggests that most engineers/companies haven't figured it out yet.