Advertisement · 728 × 90
#
Hashtag
#CloudNativeLondon
Advertisement · 728 × 90
Preview
Cloud Native London, July 2025, Wed, Jul 2, 2025, 6:00 PM | Meetup Hi folks! Welcome to our July Cloud Native London meetup! Join us to hear from our two great speakers and network with your fellow techies over pizza and drinks, or altern

✈️ kaspernissen.xyz is bringing clarity to London!

Tomorrow at #CloudNativeLondon: breaking down #observability with #OpenTelemetry & #Perses - open, scalable, and devs in control.

Tired of rigid dashboards and expensive tools? This talk is about breaking free.

🔗 www.meetup.com/cloud-native...

1 0 0 0
Post image Post image

Can’t believe another #CloudNativeLondon is over! Today @cpurdy and @gene_gleyzer announced Ecstasy - a brand new language and @sarahjwells closed with some hot tips on complex and distributed systems 🔥Are you joining us next year? http://CloudNativelondon.com

0 0 0 0

What worked before doesn't work in cloud native. Rigorous change management-> agile testing in production. Zero downtime deployment, not deploy windows.

"Tell your business that you need to have a risk/error budget to move faster." --@sarahjwells [fin] #CloudNativeLondon

0 0 0 0

Understand your steady state, minimize the blast radius of potential failures, and check your assumptions using chaos engineering or disaster recovery testing. #CloudNativeLondon

0 0 1 0

Both @sarahjwells and @Yuryu have reinforced today that a backup is not a restore. You need to test that you can restore, otherwise "it's just some files on a disk". #CloudNativeLondon

0 0 1 0

"When it hurts, do it more often and bring the pain forward," quoting @jezhumble.

Better to discover your failovers only work when both datacenters are up, before you need to during a hard failure. #CloudNativeLondon

0 0 1 0

[ed: ow, my head hearts looking at that set of 100+ red/green tiles]. But they're hoping to drop down to 6 tiles next year!

"Your dashboards are scar tissue of your previous incidents," says @sarahjwells quoting yours truly, and also plugging @honeycombio <3. #CloudNativeLondon

0 0 1 0

They went overboard on metrics and overalerting on metrics, but then cut back to RED metrics instead.

Monitoring can tell you when things are wrong, but not *where* they are wrong.

So in the next year they'd like to monitor against business capabilities. #CloudNativeLondon

0 0 1 0

They have done a primitive version of distributed tracing -- passing request ids and structured logging the request id when they write log lines. [ed: although, hey, @sarahjwells we'd love it if you used @honeycombio ;)] #CloudNativeLondon

0 0 1 0

There's now a way to set measurable goals around percentage of service with runbook coverage, etc.

But runbooks don't solve everything. You also need to build with observability in mind. Log aggregation has been useful for @sarahjwells's teams. #CloudNativeLondon

0 0 1 0

Use nudges with checklists & scorecards to enforce paying down operational debt -- ensuring your runbooks are kept up to date and useful.

Once something meets a minimum score, _then_ do the labor intensive human review. #CloudNativeLondon

0 0 1 0

Every system and service needs to have an owner (that's a team), rather than letting things go stale and contacting people on vacation or who have left the company.

Have a service graph encoding which teams, systems, and products exist. Great for GDPR too! #CloudNativeLondon

0 0 1 0

Quoting @copyconstruct again, @sarahjwells says that there's a taxonomy of testing in production and it's a spectrum.

There's no point in finding things quickly if you can't fix them quickly. So prioritizing time to restore service is the most important. #CloudNativeLondon

0 0 1 0

Canary releases of code for A/B testing and evaluation can also be useful. #CloudNativeLondon

0 0 1 0

Setting up flowcharts for expected behavior can help you achieve common understanding (or correct the code!) of what process should look like.

Also, use feature flags to separate code release from functionality being enabled. #CloudNativeLondon

0 0 1 0

Do contract testing for the key interfaces. [ed: it's... almost like agreeing upon an SLO with your customers, but about the API as well as availability!] #CloudNativeLondon

0 0 1 0

defining success: what does "publish succeeded" mean? for the FT, it has to be in all regions they operate in. So they changed their synthetic prober to check their synthetic stories appeared in all of them. #CloudNativeLondon

0 0 1 0

And then, of course, check your synthetic monitoring prober to make sure it is up, but that's a much smaller problem than trying to monitor "up" for your whole system.

You really need synthetic traffic for bursty/low real traffic. 0 QPS sometimes is normal. #CloudNativeLondon

0 0 1 0

Instead of shifting your tests left, perhaps shift your ability to test rightward towards production.

Have synthetics [ed: or have SLOs!] that expose whether the system is working in prod and let you start debugging if it's not working. #TestInProduction #CloudNativeLondon

0 0 1 0

You also get brittleness out of your fixtures if you have rigid acceptance tests. It's not a good ROI to spend weeks fixing tests that get out of date. #CloudNativeLondon

0 0 1 0

"Full stack on your laptop only works to a point; you eventually get a distributed monolith or have to reproduce your cloud provider's services to do that." --@sarahjwells #CloudNativeLondon

0 0 1 0

So let's talk #TestInProduction. That doesn't mean no pre-release testing. You still need automated testing.

Citing @copyconstruct, have fake versions of your services rather than spinning up the entire stack to test one component. #CloudNativeLondon

0 0 1 0

.@sarahjwells on error budgets and SLOs: "We aren't a nuclear power plant or hospital. Nobody will die if we're broken for a little while. Things working most of the time, and eventually getting fixed, is good enough." #CloudNativeLondon

0 0 1 0

We can't do full regression testing on everything, nor should we assume that we only need to test services in isolation; instead, we need to have a risk-driven approach. #CloudNativeLondon

0 0 1 0

Decrease your change fail rate _and_ increase your release rate. You can have both. 15% failure -> 1% failure rate, and 250x the release rate.

One is a consequence of the other -- smaller changes are easier to understand. #CloudNativeLondon

0 0 1 0

The lines are blurring, people no longer spend a majority of their time "just writing code" and spend far more time doing ancillary full-lifecycle operations.

But this has a payoff for letting you move faster. #CloudNativeLondon

0 0 1 0

We must test the resilience of our services and ability to take things down.

Use containers, orchestration, and SaaS, but... then it makes it harder to run it locally and see what happens. Test your interactions with third parties. #CloudNativeLondon

0 0 1 0

and actually _test_ your automation, or else it'll subtly break and won't work during a real emergency. #CloudNativeLondon

0 0 1 0

12-factor applications are able to cope with what happens in cloud production environments.

Testing your code is no longer enough; you need to test behaviors you'll see in prod such as being restarted, etc.

and you'll need to automate much more. #CloudNativeLondon

0 0 1 0

If you're lifting and shifting, you're not getting anything good out of it. Go cloud native all the way to get the benefits of it. #CloudNativeLondon

0 0 1 0