✈️ kaspernissen.xyz is bringing clarity to London!
Tomorrow at #CloudNativeLondon: breaking down #observability with #OpenTelemetry & #Perses - open, scalable, and devs in control.
Tired of rigid dashboards and expensive tools? This talk is about breaking free.
🔗 www.meetup.com/cloud-native...
Can’t believe another #CloudNativeLondon is over! Today @cpurdy and @gene_gleyzer announced Ecstasy - a brand new language and @sarahjwells closed with some hot tips on complex and distributed systems 🔥Are you joining us next year? http://CloudNativelondon.com
What worked before doesn't work in cloud native. Rigorous change management-> agile testing in production. Zero downtime deployment, not deploy windows.
"Tell your business that you need to have a risk/error budget to move faster." --@sarahjwells [fin] #CloudNativeLondon
Understand your steady state, minimize the blast radius of potential failures, and check your assumptions using chaos engineering or disaster recovery testing. #CloudNativeLondon
Both @sarahjwells and @Yuryu have reinforced today that a backup is not a restore. You need to test that you can restore, otherwise "it's just some files on a disk". #CloudNativeLondon
"When it hurts, do it more often and bring the pain forward," quoting @jezhumble.
Better to discover your failovers only work when both datacenters are up, before you need to during a hard failure. #CloudNativeLondon
[ed: ow, my head hearts looking at that set of 100+ red/green tiles]. But they're hoping to drop down to 6 tiles next year!
"Your dashboards are scar tissue of your previous incidents," says @sarahjwells quoting yours truly, and also plugging @honeycombio <3. #CloudNativeLondon
They went overboard on metrics and overalerting on metrics, but then cut back to RED metrics instead.
Monitoring can tell you when things are wrong, but not *where* they are wrong.
So in the next year they'd like to monitor against business capabilities. #CloudNativeLondon
They have done a primitive version of distributed tracing -- passing request ids and structured logging the request id when they write log lines. [ed: although, hey, @sarahjwells we'd love it if you used @honeycombio ;)] #CloudNativeLondon
There's now a way to set measurable goals around percentage of service with runbook coverage, etc.
But runbooks don't solve everything. You also need to build with observability in mind. Log aggregation has been useful for @sarahjwells's teams. #CloudNativeLondon
Use nudges with checklists & scorecards to enforce paying down operational debt -- ensuring your runbooks are kept up to date and useful.
Once something meets a minimum score, _then_ do the labor intensive human review. #CloudNativeLondon
Every system and service needs to have an owner (that's a team), rather than letting things go stale and contacting people on vacation or who have left the company.
Have a service graph encoding which teams, systems, and products exist. Great for GDPR too! #CloudNativeLondon
Quoting @copyconstruct again, @sarahjwells says that there's a taxonomy of testing in production and it's a spectrum.
There's no point in finding things quickly if you can't fix them quickly. So prioritizing time to restore service is the most important. #CloudNativeLondon
Canary releases of code for A/B testing and evaluation can also be useful. #CloudNativeLondon
Setting up flowcharts for expected behavior can help you achieve common understanding (or correct the code!) of what process should look like.
Also, use feature flags to separate code release from functionality being enabled. #CloudNativeLondon
Do contract testing for the key interfaces. [ed: it's... almost like agreeing upon an SLO with your customers, but about the API as well as availability!] #CloudNativeLondon
defining success: what does "publish succeeded" mean? for the FT, it has to be in all regions they operate in. So they changed their synthetic prober to check their synthetic stories appeared in all of them. #CloudNativeLondon
And then, of course, check your synthetic monitoring prober to make sure it is up, but that's a much smaller problem than trying to monitor "up" for your whole system.
You really need synthetic traffic for bursty/low real traffic. 0 QPS sometimes is normal. #CloudNativeLondon
Instead of shifting your tests left, perhaps shift your ability to test rightward towards production.
Have synthetics [ed: or have SLOs!] that expose whether the system is working in prod and let you start debugging if it's not working. #TestInProduction #CloudNativeLondon
You also get brittleness out of your fixtures if you have rigid acceptance tests. It's not a good ROI to spend weeks fixing tests that get out of date. #CloudNativeLondon
"Full stack on your laptop only works to a point; you eventually get a distributed monolith or have to reproduce your cloud provider's services to do that." --@sarahjwells #CloudNativeLondon
So let's talk #TestInProduction. That doesn't mean no pre-release testing. You still need automated testing.
Citing @copyconstruct, have fake versions of your services rather than spinning up the entire stack to test one component. #CloudNativeLondon
.@sarahjwells on error budgets and SLOs: "We aren't a nuclear power plant or hospital. Nobody will die if we're broken for a little while. Things working most of the time, and eventually getting fixed, is good enough." #CloudNativeLondon
We can't do full regression testing on everything, nor should we assume that we only need to test services in isolation; instead, we need to have a risk-driven approach. #CloudNativeLondon
Decrease your change fail rate _and_ increase your release rate. You can have both. 15% failure -> 1% failure rate, and 250x the release rate.
One is a consequence of the other -- smaller changes are easier to understand. #CloudNativeLondon
The lines are blurring, people no longer spend a majority of their time "just writing code" and spend far more time doing ancillary full-lifecycle operations.
But this has a payoff for letting you move faster. #CloudNativeLondon
We must test the resilience of our services and ability to take things down.
Use containers, orchestration, and SaaS, but... then it makes it harder to run it locally and see what happens. Test your interactions with third parties. #CloudNativeLondon
and actually _test_ your automation, or else it'll subtly break and won't work during a real emergency. #CloudNativeLondon
12-factor applications are able to cope with what happens in cloud production environments.
Testing your code is no longer enough; you need to test behaviors you'll see in prod such as being restarted, etc.
and you'll need to automate much more. #CloudNativeLondon
If you're lifting and shifting, you're not getting anything good out of it. Go cloud native all the way to get the benefits of it. #CloudNativeLondon