Advertisement · 728 × 90

Posts by Musah Abdulai

Preview
Professional Cloud DevOps Engineer Certification was issued by Google Cloud to Musah Abdulai. Professional Cloud DevOps Engineers implement processes throughout the systems development lifecycle using Google-recommended methodologies and tools. They build and deploy software and infrastructure...

Got my Google Cloud Professional Cloud DevOps Engineer cert last week (Jan 4).

What I’m taking into production LLM/RAG work: safer deployments, better monitoring/alerting, tighter access/tool controls, and spend limits.

www.credly.com/badges/2ceb1...

3 months ago 2 0 0 0

Designing with smaller models isn’t just cost-cutting:
• Faster feedback loops
• Easier load planning
• Less painful mistakes

Use the big models for the 10% of flows where they materially change the outcome.

3 months ago 1 0 0 0

Don’t ask “how do we make this LLM smarter?”
First ask:
• What are we willing to be wrong about?
• How much are we willing to pay per success?
• Where must a human always stay in the loop?

Good constraints turn AI from a toy into a system.

3 months ago 1 0 0 0

An AI feature is “MVP” until:
• It has clear SLOs
• It has owners
• It has dashboards
• It has a kill switch

After that, it’s production.
Everything else is a live demo with unsuspecting users.

3 months ago 1 0 0 0

Your AI platform should answer 3 questions instantly:
• What’s our spend today and who drove it?
• What broke in prod in the last hour?
• Which prompts/tools caused the most failures?
If you need a meeting to answer these, you’re not ready to scale usage.

3 months ago 1 0 0 0

Before bragging about “AI agents in production”, show:
• Your rate limits
• Your circuit breakers
• Your rollback plan
• Your max monthly spend per tenant

Otherwise it’s not a system, it’s a stunt.

3 months ago 1 0 0 0

You don’t secure an AI system by “red teaming it once”.
You secure it by:
• Defining what it must never do
• Making those rules enforceable in code
• Monitoring for violations in production
• Having a way to shut it down fast
Policy → controls → telemetry → kill switch.

3 months ago 2 0 0 0

AI agents shouldn’t be trusted by default.
Give them:
• Narrow scope
• Limited tools
• Explicit budgets
• Clear owners
If you can’t answer “who’s on call for this agent?” it has too much power.

3 months ago 1 0 0 0

“The model is cheap” is not a cost strategy.
Real levers:
• Fewer round trips
• Less useless context
• Smarter routing between models
• Caching stable answers
Every avoided call is 100% cheaper and 100% safer.

3 months ago 0 0 0 0
Advertisement

Before tuning prompts, ask:
• What’s the acceptable error rate?
• What’s the max we’re willing to pay per request?
• What does “graceful failure” look like?

LLM systems without these constraints are vibes, not engineering.

4 months ago 1 0 0 0

An AI agent calling tools is cool.
An AI agent calling tools with:
• Timeouts
• Retry limits
• Circuit breakers
• Spend guards

…is something you can show to your SRE and finance teams without apologizing.

4 months ago 1 0 0 0

LLM stacks have 3 pillars:
• Quality → does it help?
• Reliability → does it work today and tomorrow?
• Cost → can we afford success?

Most teams romanticize #1 and discover #2 and #3 when finance and ops show up.

4 months ago 1 0 0 0

AI cost isn’t “our OpenAI bill is high”.

It’s:
• Engineers debugging flaky agents
• Support fixing silent failures
• RevOps dealing with bad insights

Reliability is a cost-optimization strategy.

4 months ago 1 0 0 0

“We have an AI agent that can do everything.”

Translation:
• Unbounded scope
• Unpredictable latency
• Unknown worst-case cost
• Impossible to test

Narrow agents with clear contracts > one omnipotent chaos agent.

4 months ago 1 0 0 0

A lot of “AI observability” talk is dashboards.
What you actually need:
• Can we say “turn this feature OFF now”?
• Can we cap spend per tenant?
• Can we see which prompts keep failing?

Control first, charts later.

4 months ago 0 0 0 0

LLM reliability trick: design like this 👇

1. Small, cheap model for routing & quick wins
2. Medium model for most requests
3. Big model only for high-value, audited paths

You’ll save cost and reduce how often users see “smart but wrong” answers.

4 months ago 1 0 0 0

Optimize LLM cost like an engineer, not a gambler:
• Measure cost per successful outcome, not per token
• Cache aggressively where correctness is stable
• Use smaller models for validation and guardrails

“We shaved 40% of tokens” means nothing if quality tanked.

4 months ago 1 0 0 0

Your AI system is “secure” and “reliable”?
Cool. Now show me:
• How you test changes to prompts & tools
• How you roll back a bad deployment
• How you cap spend in a runaway loop

If the answer is manual heroics, you’re not there yet.

4 months ago 1 0 0 0
Advertisement

AI agents are just microservices that hallucinate.

You still need:
• Timeouts & retries
• Rate limits
• Idempotency
• Cost ceilings

Treat them like unreliable juniors with prod access, not like magic.

4 months ago 1 0 0 0

If your AI app has:
• No p95 latency target
• No cost per-query budget
• No clear failure modes

…you don’t have a product.
You have an expensive, occasionally helpful surprise.

4 months ago 1 0 0 0

The most expensive tokens in your RAG system aren’t the ones you send.

They’re the ones that:
• Hit sensitive docs
• Bypass weak filters
• End up screenshotted into Slack forever

Data minimization is a cost control.

4 months ago 1 0 0 0

Before you optimize RAG latency from 1.2s → 0.8s, ask:

• Do we know our top 10 expensive users?
• Do we know which indexes drive 80% of cost?
• Do we know our riskiest collections?

Performance tuning without cost & risk data is vibes-based engineering.

4 months ago 1 0 0 0

Your vector DB is now:
• A data warehouse
• A search engine
• An attack surface
• A cost center

Still treating it like a sidecar for “chat with your docs” is how you get surprise invoices and surprise incidents.

4 months ago 1 0 0 0

Hot take:
“Guardrails” are often a guilt-offload for not doing:
• Proper access control
• Per-tenant isolation
• Input/output logging

LLM wrappers won’t fix a broken security model. They just make it more expensive.

4 months ago 2 0 0 0

Hidden RAG cost center: abuse.

• No per-user rate limits
• Unlimited queries on expensive models
• Tool calls that hit paid APIs

Congrats, you just built a token-minter for attackers.
Security is also about protecting your wallet.

4 months ago 1 0 0 0
Advertisement

Observability for RAG isn’t just “for quality”:
• Track token spend per user/tenant
• Track which collections are most queried
• Track which prompts hit sensitive docs

Same logs help with cost optimization AND security forensics. Double win.

4 months ago 1 0 0 0

Every “just in case” token you send has a cost:
• Direct $$
• Latency
• Attack surface

Prune your retrieval:
• Fewer, higher-quality chunks
• Explicit collections
• Permission-aware filters

Spend less, answer faster, leak less.

4 months ago 1 0 0 0

Your RAG threat model should include finance:
• Prompt injection that triggers many tool calls
• Queries crafted to hit max tokens every time
• Abuse of “unlimited internal use” policies

Attackers don’t need your data if they can just drain your budget.

4 months ago 1 0 0 0

RAG tradeoff triangle:
• More context → more tokens
• Less context → more hallucinations
• No security → more incidents

Most teams only tune the first two.
Mature teams treat security as a cost dimension too.

4 months ago 1 0 0 0

“Low token cost” demos lie.

In real life RAG:
• 20–50 retrieved chunks
• Tool calls
• Follow-up questions

Now add:
• No rate limits
• No abuse detection
• No guardrails on tools

Congrats, you’ve built a DoS and data-exfil API with pretty UX.

4 months ago 1 0 0 0