Next level: Thanos downsampling.
30d raw โ 1y 5-min rollups โ 5y hourly. Your long-retention S3 bill drops by 20-240x. Historical queries get faster.
Full FinOps playbook:
podostack.com/p/prometheu... ๐ ๏ธ
Posts by Ilia Gusev
The fix is labeldrop:
metric_relabel_configs:
- action: labeldrop
regex: pod_template_hash
One line. Cuts your series count by 5x on a busy cluster. Doesn't require touching any application code.
The usual culprits: pod_template_hash (changes every deploy), request_id (unique per request), user_id, git_commit.
Each of these turns a reasonable metric into a cardinality bomb. Most of them should be dropped at scrape time, not stored.
Find the offenders in 5 seconds:
topk(20, count by (__name__)({__name__!=""}))
This ranks metrics by how many series they generate. The top 20 usually contains 80% of your total. Write down the names - that's your attack list.
Cardinality = unique time series. Not bytes. Not datapoints. Series.
Every unique combination of labels is a new series. Add one label with 1000 values and your 50-series metric becomes 50,000 series. Prometheus loads all of them into RAM.
Your Prometheus memory keeps climbing and nobody knows why.
You're not out of metrics. You're out of cardinality. And it's costing real money - CPU, RAM, S3.
podostack.com/p/prometheu...
The combo - weighted NodePools + broad instance flexibility + PDBs - turns spot from "dev only" into "run literally everything."
Full spot pattern with YAML:
podostack.com/p/karpenter... ๐ ๏ธ
Pair this with PodDisruptionBudgets on critical workloads and Karpenter handles spot interruption gracefully:
1. AWS sends interruption notice
2. Karpenter cordons the doomed node
3. Drains within PDB limits
4. Provisions replacement
You don't even notice.
The subtle trick: instance flexibility.
Lock spot to one family (c6i only) = fragile, one pool exhausts, fallback fires.
Open it to categories c, m, r across generations and architectures = Karpenter has dozens of pools to pick from. Interruption rates drop.
Your bill benefits when spot is plentiful (70% discount). Your uptime benefits when spot is scarce (seamless fallback). The pod doesn't know the difference.
One YAML pattern. Works for anything stateless.
Two NodePools:
spot-pool โ weight: 10 (lower = higher priority)
ondemand-pool โ weight: 20 (fallback)
Karpenter tries the lowest weight first. If spot capacity isn't available, it immediately falls through to on-demand. The pod never sits Pending.
Teams avoid spot in production because "what if capacity runs out?"
The answer isn't "don't use spot." The answer is two NodePools and the weight field.
podostack.com/p/karpenter...
Five patterns. Real production scenarios. YAML you can ship today.
podostack.com/p/karpenter... ๐ ๏ธ
What's inside:
- NodePool + EC2NodeClass: the responsibility split
- Spot-to-On-Demand fallback with weights
- TopologySpread: the DoNotSchedule trap
- SpotToSpot consolidation
- Why Descheduler is an anti-pattern with Karpenter
New Podo Stack just dropped.
This week: Karpenter Beyond Basics. Five patterns that separate a demo-grade setup from one that actually saves you money in production.
podostack.com/p/karpenter...
nodeSelector for the simplest cases. nodeAffinity for everything else. The five extra lines of YAML save you from Pending pods at 3 AM.
Full guide with YAML examples:
podostack.com/p/kubernete... ๐ ๏ธ
Anti-pattern: using nodeSelector for zone spreading.
If us-east-1a runs out of capacity, your pods sit Pending. With preferredDuringScheduling you get zone preference without the deadlock. The scheduler does its best but doesn't block.
The killer combo: required + preferred together.
Required: "must be amd64 OR arm64"
Preferred: "prefer arm64 with weight 80"
The scheduler places pods on ARM when available (cheaper), falls back to AMD when ARM is full. One spec. Graceful degradation.
nodeSelector is a hard match. Label exists = schedule. Label missing = Pending forever. No fallback. No preference. No nuance.
nodeAffinity gives you two modes:
- requiredDuringScheduling (hard rule, like nodeSelector but with operators)
- preferredDuringScheduling (soft preference with weights)
There's a moment in every Kubernetes journey where nodeSelector stops being enough.
You need GPU nodes for ML. You're migrating to ARM64 with a fallback. You want "prefer this pool, but don't crash if it's full."
nodeSelector can't do any of that.
podostack.com/p/kubernete...
Full deep dive on how Temporal kills the state hell - Workflow vs Activity, event sourcing replay, signals, retries, versioning in prod.
podostack.com/p/temporal-... ๐ ๏ธ
The sleep isn't sleeping your process. It's a durable timer on the Temporal server.
Pod crashes, deployment rolls out, region fails over. The timer fires on schedule. The workflow picks up exactly where it left off.
No cron. No state flags. No in-flight migrations.
Temporal flips the model.
You write the workflow as one sequential function. workflow.Sleep(24*time.Hour). workflow.WaitForSignal(...). It looks like code that ignores failure.
The platform handles the durability.
This is the state management hell.
The business logic gets smeared across your infrastructure. Debugging means grepping five systems. Changing "wait 3 days" to "wait 4 days" needs a migration for in-flight state.
Testing it? Mock time and half your stack.
Where does "wait 24 hours" actually live?
Cron polling a DB every minute? Delayed queue that drops jobs on broker restart? State flag column with a scheduler + retry table + dead letter queue?
Six lines of logic. Five systems. Zero sleep for the on-call engineer.
You've written this workflow before.
Register user โ send email โ wait 24h โ remind โ wait 3 days โ bonus or nudge.
Easy in your head. A nightmare across cron, queues, and state flags.
podostack.com/p/temporal-...
Bonus trick: create a NEW index as invisible first, run it in prod on a replica, then flip visible if metrics improve.
A/B testing for index design, zero rollback cost.
Full guide:
podostack.com/p/invisible... ๐ ๏ธ
The catches:
Primary keys and UNIQUE indexes can't go invisible (they enforce constraints, not just reads).
Writes still hit the index - so this tests the READ path, not the write path. If you want to reduce write amplification, you still have to drop.
The workflow:
1. Make the index invisible
2. Wait 24-48 hours, watch p95 and slow query log
3. If nothing broke - DROP INDEX for real
4. If something broke - flip it back
It turns "risky DDL" into "safe experiment."
Nothing is faster to roll back.
If something regresses, flip it back:
ALTER TABLE orders ALTER INDEX ix_user_status VISIBLE;
Also instant. No CREATE INDEX on a 50M row table. No locking. The index was never physically gone.