Got my Google Cloud Professional Cloud DevOps Engineer cert last week (Jan 4).
What I’m taking into production LLM/RAG work: safer deployments, better monitoring/alerting, tighter access/tool controls, and spend limits.
www.credly.com/badges/2ceb1...
Posts by Musah Abdulai
Designing with smaller models isn’t just cost-cutting:
• Faster feedback loops
• Easier load planning
• Less painful mistakes
Use the big models for the 10% of flows where they materially change the outcome.
Don’t ask “how do we make this LLM smarter?”
First ask:
• What are we willing to be wrong about?
• How much are we willing to pay per success?
• Where must a human always stay in the loop?
Good constraints turn AI from a toy into a system.
An AI feature is “MVP” until:
• It has clear SLOs
• It has owners
• It has dashboards
• It has a kill switch
After that, it’s production.
Everything else is a live demo with unsuspecting users.
Your AI platform should answer 3 questions instantly:
• What’s our spend today and who drove it?
• What broke in prod in the last hour?
• Which prompts/tools caused the most failures?
If you need a meeting to answer these, you’re not ready to scale usage.
Before bragging about “AI agents in production”, show:
• Your rate limits
• Your circuit breakers
• Your rollback plan
• Your max monthly spend per tenant
Otherwise it’s not a system, it’s a stunt.
You don’t secure an AI system by “red teaming it once”.
You secure it by:
• Defining what it must never do
• Making those rules enforceable in code
• Monitoring for violations in production
• Having a way to shut it down fast
Policy → controls → telemetry → kill switch.
AI agents shouldn’t be trusted by default.
Give them:
• Narrow scope
• Limited tools
• Explicit budgets
• Clear owners
If you can’t answer “who’s on call for this agent?” it has too much power.
“The model is cheap” is not a cost strategy.
Real levers:
• Fewer round trips
• Less useless context
• Smarter routing between models
• Caching stable answers
Every avoided call is 100% cheaper and 100% safer.
Before tuning prompts, ask:
• What’s the acceptable error rate?
• What’s the max we’re willing to pay per request?
• What does “graceful failure” look like?
LLM systems without these constraints are vibes, not engineering.
An AI agent calling tools is cool.
An AI agent calling tools with:
• Timeouts
• Retry limits
• Circuit breakers
• Spend guards
…is something you can show to your SRE and finance teams without apologizing.
LLM stacks have 3 pillars:
• Quality → does it help?
• Reliability → does it work today and tomorrow?
• Cost → can we afford success?
Most teams romanticize #1 and discover #2 and #3 when finance and ops show up.
AI cost isn’t “our OpenAI bill is high”.
It’s:
• Engineers debugging flaky agents
• Support fixing silent failures
• RevOps dealing with bad insights
Reliability is a cost-optimization strategy.
“We have an AI agent that can do everything.”
Translation:
• Unbounded scope
• Unpredictable latency
• Unknown worst-case cost
• Impossible to test
Narrow agents with clear contracts > one omnipotent chaos agent.
A lot of “AI observability” talk is dashboards.
What you actually need:
• Can we say “turn this feature OFF now”?
• Can we cap spend per tenant?
• Can we see which prompts keep failing?
Control first, charts later.
LLM reliability trick: design like this 👇
1. Small, cheap model for routing & quick wins
2. Medium model for most requests
3. Big model only for high-value, audited paths
You’ll save cost and reduce how often users see “smart but wrong” answers.
Optimize LLM cost like an engineer, not a gambler:
• Measure cost per successful outcome, not per token
• Cache aggressively where correctness is stable
• Use smaller models for validation and guardrails
“We shaved 40% of tokens” means nothing if quality tanked.
Your AI system is “secure” and “reliable”?
Cool. Now show me:
• How you test changes to prompts & tools
• How you roll back a bad deployment
• How you cap spend in a runaway loop
If the answer is manual heroics, you’re not there yet.
AI agents are just microservices that hallucinate.
You still need:
• Timeouts & retries
• Rate limits
• Idempotency
• Cost ceilings
Treat them like unreliable juniors with prod access, not like magic.
If your AI app has:
• No p95 latency target
• No cost per-query budget
• No clear failure modes
…you don’t have a product.
You have an expensive, occasionally helpful surprise.
The most expensive tokens in your RAG system aren’t the ones you send.
They’re the ones that:
• Hit sensitive docs
• Bypass weak filters
• End up screenshotted into Slack forever
Data minimization is a cost control.
Before you optimize RAG latency from 1.2s → 0.8s, ask:
• Do we know our top 10 expensive users?
• Do we know which indexes drive 80% of cost?
• Do we know our riskiest collections?
Performance tuning without cost & risk data is vibes-based engineering.
Your vector DB is now:
• A data warehouse
• A search engine
• An attack surface
• A cost center
Still treating it like a sidecar for “chat with your docs” is how you get surprise invoices and surprise incidents.
Hot take:
“Guardrails” are often a guilt-offload for not doing:
• Proper access control
• Per-tenant isolation
• Input/output logging
LLM wrappers won’t fix a broken security model. They just make it more expensive.
Hidden RAG cost center: abuse.
• No per-user rate limits
• Unlimited queries on expensive models
• Tool calls that hit paid APIs
Congrats, you just built a token-minter for attackers.
Security is also about protecting your wallet.
Observability for RAG isn’t just “for quality”:
• Track token spend per user/tenant
• Track which collections are most queried
• Track which prompts hit sensitive docs
Same logs help with cost optimization AND security forensics. Double win.
Every “just in case” token you send has a cost:
• Direct $$
• Latency
• Attack surface
Prune your retrieval:
• Fewer, higher-quality chunks
• Explicit collections
• Permission-aware filters
Spend less, answer faster, leak less.
Your RAG threat model should include finance:
• Prompt injection that triggers many tool calls
• Queries crafted to hit max tokens every time
• Abuse of “unlimited internal use” policies
Attackers don’t need your data if they can just drain your budget.
RAG tradeoff triangle:
• More context → more tokens
• Less context → more hallucinations
• No security → more incidents
Most teams only tune the first two.
Mature teams treat security as a cost dimension too.
“Low token cost” demos lie.
In real life RAG:
• 20–50 retrieved chunks
• Tool calls
• Follow-up questions
Now add:
• No rate limits
• No abuse detection
• No guardrails on tools
Congrats, you’ve built a DoS and data-exfil API with pretty UX.