๐ข ๐ง๐ต๐ฒ ๐ฆ๐๐ฎ๐๐ฒ ๐ผ๐ณ ๐ ๐ผ๐ฑ๐ฒ๐น ๐ฆ๐ฒ๐ฟ๐๐ถ๐ป๐ด ๐๐ผ๐บ๐บ๐๐ป๐ถ๐๐ถ๐ฒ๐: ๐๐ฝ๐ฟ๐ถ๐น ๐๐ฑ๐ถ๐๐ถ๐ผ๐ป ๐ถ๐ ๐ผ๐๐!
Our goal with this newsletter is to give a clear, community-driven view of whatโs happening across the model serving ecosystem, including updates from projects like vLLM, KServe, @llm-d.ai, @kubernetes.io, Llama Stack, and more.
Posts by llm-d
Check out the latest newsletter to stay up to speed on the changes happening in the model serving communities!
ICYMI: llm-d is officially a @CNCF Sandbox project! ๐
Weโre evolving #Kubernetes into SOTA AI infrastructure through a powerhouse coalition including Red Hat , Google Cloud , IBM Research, NVIDIA, Mistral AI, Hugging Face , and many more.
www.cncf.io/blog/2026/03...
Itโs official: llm-d has joined the cncf.io ! ๐
Our mission to evolve Kubernetes into SOTA AI infrastructure just got a massive boost. This milestone belongs to our amazing community.
Thank you for building this with us. ๐
Weโre just getting started!
๐ www.cncf.io/blog/2026/03...
Itโs official: llm-d has joined the cncf.io ! ๐
Our mission to evolve Kubernetes into SOTA AI infrastructure just got a massive boost. This milestone belongs to our amazing community.
Thank you for building this with us. ๐
Weโre just getting started!
๐ www.cncf.io/blog/2026/03...
Deploying or scaling LLM inference? This is the room to be in. ๐
The vLLM Inference Meetup hits Boston on March 31! Join us for an evening of deep technical sessions, live demos, and real conversations with the community.
๐
Mar 31, 5PM
๐ 314 Main St, Cambridge
๐ luma.com/4rmkrrb7
LLMInferenceService is now fully production-ready and built on the high-performance @llm-d.ai framework.
๐ช๐ต๐ฎ๐โ๐ ๐ถ๐ป๐ฐ๐น๐๐ฑ๐ฒ๐ฑ?
- KV-cache aware routing and disaggregated prefill-decode to maximize throughput.
The results with Llama 3.1 8B:
โ
Lower TTFT on cache hits.
โ
Full visibility into scoring decisions.
โ
Improved throughput & GPU utilization.
Watch the full walkthrough: youtu.be/NN-1JvnMMrU
Watch this preview of distributed tracing (llm-d 0.6) and Prefix Cache-Aware Routing.
๐น State Tracking: llm-d tracks KV cache via ZMQ.
๐น Smart Scoring: EPP pods tokenize prompts and query to find cached blocks.
๐น Optimal Routing: Reqs go to the pod for the best cache hit.
The first llm-d NYC Meetup is live!
Deep dive into the open-source stack for cloud-native inference with @IBMResearch, @AMD, and @RedHat
State-aware scheduling & KV cache reuse
P/D disaggregation
Scaling MoE models
AMD ROCm & llm-d
Watch: www.youtube.com/watch?v=_ZBQ...
๐ข ๐ง๐ต๐ฒ ๐ฆ๐๐ฎ๐๐ฒ ๐ผ๐ณ ๐ ๐ผ๐ฑ๐ฒ๐น ๐ฆ๐ฒ๐ฟ๐๐ถ๐ป๐ด ๐๐ผ๐บ๐บ๐๐ป๐ถ๐๐ถ๐ฒ๐: ๐ ๐ฎ๐ฟ๐ฐ๐ต ๐๐ฑ๐ถ๐๐ถ๐ผ๐ป ๐ถ๐ ๐ผ๐๐!
We launched our newsletter publicly last year to share our contributions to upstream communities from our Red Hat AI teams. Weโve gained over ๐ญ๐ฏ๐ฌ๐ฌ ๐๐๐ฏ๐๐ฐ๐ฟ๐ถ๐ฏ๐ฒ๐ฟ๐!
Final Call: NYC ๐ฝ
Registration for the llm-d Meetup closes, Tuesday March 10.
Join the community this Wednesday at the IBM 1 Madison office for a deep dive into llm-d 0.5, MoE scaling, and KV-cache offloading.
Don't miss a night of high-signal technical talks.
๐๏ธ Register now: luma.com/0crwqwg4
Planning to join us in NYC next week? ๐๏ธ
Registration for the llm-d Distributed Inference Meetup closes this Tuesday, March 10th.
Don't miss out on a night of technical talks and networking with the community at the IBM 1 Madison office. Grab your spot now!
๐๏ธ luma.com/0crwqwg4
#llmd #NYCMeetup
Whatโs on the agenda for next Wednesday's NYC meetup?
๐ ๏ธ Intro to llm-d 0.5
โก๏ธ Distributed LLM serving on AMD
๐ง Lessons scaling Wide-EP and MoE
๐พ KV-cache offloading & prefix scheduling
Join the engineers building the future of open-source inference.
Details: luma.com/0crwqwg4
Join us next week in NYC with the llm-d community for a deep dive into distributed inference.
Weโre talking llm-d 0.5, scaling MoE models, and KV-cache offloading.
If you're building LLM infra, don't miss this.
๐
March 11th
๐1 Madison Ave
Register: luma.com/0crwqwg4
In the latest llm-d release, weโre tackling high hardware costs with the new GPU Recommendation Tool! ๐
Evaluate throughput, latency, and cost-effectiveness before requesting expensive cluster resources.
Check out the full demo: www.youtube.com/watch?v=Y26i...
There are many more sessions and community meetups happening throughout the year.
Check the full calendar for session details, room numbers, and the complete list of talks from the llm-d community:
๐ llm-d.ai/docs/communi...
๐ Stop 3: PyTorch Conference Europe
๐
April 7โ8 | Paris
Deep technical tracks on chunked decoding, preemptive scheduling, and disaggregated tokenization. We'll be sharing the latest on state-aware serving with vLLM + llm-d.
Full Schedule: events.linuxfoundation.org/pytorch-conf...
๐ Stop 2: KubeCon Europe
๐
March 23โ26 | Amsterdam
From Istio Day to the main stage, weโre talking AI-aware routing and KV-cache scheduling. Don't miss our tutorial on building resilient LLM gateways with Kubernetes.
Details: events.linuxfoundation.org/kubecon-clou...
๐ Stop 1: NYC Distributed Inference Meetup
๐
March 11 | IBM Innovation Studio
Weโre diving into the weeds of llm-d 0.5, Wide-EP, and MoE model scaling. Perfect for anyone in the city looking to optimize LLM serving on AMD and beyond.
Register: luma.com/0crwqwg4
Where to find the llm-d community over the next 2 months ๐งต
We have a busy Spring ahead with sessions in NYC, Amsterdam, and Paris. If you're building open-source infrastructure for distributed inference, come join the conversation. โฌ๏ธ
The agenda is still evolving, and weโve got even more awesomeness in the works! ๐
Whether you're running GenAI in production or building the platforms to support it, this is the room to be in.
๐
March 11 | 4:30 PM
๐ 1 Madison Ave, NYC
๐๏ธ RSVP: luma.com/0crwqwg4
Hosted by Red Hat AI, IBM Research, and AMD. ๐ค
If you're building or scaling models, this event is for you.
Weโre bringing together maintainers and engineers working on:
๐น llm-d project roadmap
๐น Optimizing for AMD hardware
๐น Scaling MoE (Mixture-of-Experts)
๐น KV-Cache & Prefix-caching performance
NYC: Ready to go deep on Distributed Inference? ๐ฝ
The llm-d community is hitting Manhattan on March 11th!
Join us at the IBM Innovation Studio for a technical deep dive into the infra powering the next generation of LLM serving. ๐งต
We'd like to announce that @kubernetes.io WG Serving has succeeded and will be disbanded! Thank you everyone who have participated and contributed to the discussions and initiatives!
More details: groups.google.com/a/kubernetes...
In case you missed it, last week the llm-d community shipped the v0.5 release.
Check out the post from the llm-d project owners to learn more about all the features we've included in this release.
llm-d.ai/blog/llm-d-v...
๐ Check out the February newsletter here: inferenceops.substack.com/p/state-of-the-model-ser...
๐ Subscribe to get future issues in your inbox: https://inferenceops.substack.com/
๐ Thanks to everyone who subscribed so far!
Kudos to all contributors to this edition!
Our goal with this newsletter is to give a clear, community-driven view of whatโs happening across the model serving ecosystem, including updates from vLLM, KServe, @llm-d.ai, @kubernetes.io, and Llama Stack.
This release is built on collaborationโfrom NIXL 0.9 merges to vLLM integrations.
We are building an open, hardware-agnostic inference control plane.
Ready to build? ๐งฑ
GitHub: github.com/llm-d/llm-d
Website: llm-d.ai
Community Calls: Wed 12:30pm ET
๐ In disaggregated serving, network congestion kills tail latency.
Weโve integrated the UCCL backend into NIXL, demonstrating 2.4x greater resilience to network contention than standard transports.