#VLLM hashtag - Bluesky

@woody-ai-tools.bsky.social

1 day ago

Multi-node vLLM on DGX Spark with Docker Compose Hey everyone, I recently picked up 2x DGX Spark systems and wanted to share a Docker Compose configuration I put together for running multi-node vLLM inference across them. My Setup Journey I’ve been looking for a new inferencing setup that could handle larger models. The 256GB total memory across 2x DGX Sparks was the easiest path to get there, and I finally got them from Central Computers this week. I already have an RTX 6000 system running llama.cpp, OpenWebUI, Langfuse, and Prometheus/Gra...

「vLLM」をマルチノードで動かす構成、ようやく実用レベルになってきた。Docker Composeで簡単にクラスタ組めるのは運用的にかなりデカい。

特にDGXみたいなリソースをフル活用するなら、ノード間通信のオーバーヘッドをどう抑えるかが鍵になる。分散推論環境を構築中のエンジニア、vLLMのノード間同期でハマりどころあれば教えて！

#AI #LLM #vLLM #DeepLearning #インフラエンジニア

 forums.developer.nvidia.com/t/multi-node-vllm-on-dgx...

2 0 0 0

Kubernetes

@kubernetes.activitypub.awakari.com.ap.brid.gy

2 days ago

vLLM Production Stack. Часть 1: Базовые возможности vLLM Статья будет о том, как быстро начать работать с vLLM и vLLM Production Stack...

#vLLM #vllm #production #stack #kubernetes #llm #vllm-production-stack #kv-кэш

Origin | Interest | Match

0 0 0 0

cloud-native

@cloud-native.activitypub.awakari.com.ap.brid.gy

5 days ago

llm-d joins the CNCF llm-d has been officially accepted as a CNCF Sandbox project. This places the project under the Linux Foundation’s management and establishes an open standard for AI inferenc...

#Infrastructure #CNCF #Kubernetes #LLM #llm-d #vLLM

Origin | Interest | Match

0 0 0 0

Bertold Kolics

@bertold.kolics.net

1 week ago

vLLM Inference Meetup · Boston · Luma Deep technical sessions. Live demos. Real conversations. If you're deploying, or scaling LLM inference, this is the room to be in. Join Red Hat AI, IBM,…

Interested in #vLLM tech? #Boston area professionals have an event coming up at the end of the month. #workshop #meetup #LLM luma.com/4rmkrrb7

0 0 0 0

Micha the DevOp

@michabbb.bsky.social

1 week ago

📄 The Setup
- Upload any #PDF → server converts each page to an image (PyMuPDF)
- Images are sent in parallel to #vLLM (continuous batching)
- The Vision LLM reads each page and returns clean Markdown
- Results stream back as NDJSON — no timeouts, even on 100+ page docs

0 0 1 0

Jeff MAURY

@jeffmaury.bsky.social

1 week ago

Get started with consuming GPU-hosted large language models on Developer Sandbox | Red Hat Developer Learn the many ways you can interact with GPU-hosted large language models (LLMs) on Developer Sandbox, including connecting the model endpoints, interacting with the API endpoints using the hosted

Want to play with GPU enabled LLMs: You should read this: developers.redhat.com/learn/ai/get-started-con... #redhat #ai #LLMs #kserve #vllm

0 1 0 0

さるぼっと@IT最新動向を配信

@sarubot.bsky.social

2 weeks ago

Paged Attention - vLLM ``` float accs[NUM_ROWS_PER_THREAD]; float accs[NUM_ROWS_PER_THREAD];for ... { // Iteration over different blocks. for ... { // Iteration over different blocks. logits_vec = ... logits_vec = ... for ... { // Iteration over different rows. for ... { // Iteration over different rows. v_vec =

LLM推論のメモリ効率を劇的に変えたPagedAttention。メモリ断片化を仮想記憶の概念で解決する発想が凄い。

・KVキャッシュ浪費を理論上ゼロに
・動的ブロック割り当てでバッチを最大化

推論エンジンの設計思想を変えた金字塔的技術。

#vLLM #LLM

0 0 0 0

AI Daily Post

@aidailypost.com

2 weeks ago

vLLM’s new PagedAttention slashes latency, cranks up GPU inference, and lets you batch continuously for production LLM workloads. Curious how it beats the OpenAI API? Dive in! #vLLM #PagedAttention #GPUInference

🔗 aidailypost.com/news/vllm-bo...

0 0 0 0

Burkhard Ringlein

@0xcaffee.bsky.social

2 weeks ago

vLLM Triton Attention Backend Deep Dive This article is adapted from a Red Hat hosted vLLM Office Hours session with Burkhard Ringlein from IBM Research, featuring a deep technical walkthrough of the vLLM Triton attention backend. Explore p...

Maintaining separate attention kernels for every GPU platform doesn't scale.

Hence, for the #vLLM #Triton #attention backend, we took a different approach: ~800 LoC Triton for NVIDIA and AMD GPUs, with SOTA performance on both.

📖 Deep dive: blog.vllm.ai/2026/03/04/v...

@pytorch.org #OpenSourceAI

1 0 0 0

Awesome Agents

@awesomeagents.bsky.social

3 weeks ago

vLLM 0.17 Ships FlashAttention 4 and Live MoE Scaling vLLM v0.17.0 adds FlashAttention 4, elastic expert parallelism for live MoE rescaling, full Qwen3.5 support, and a performance-mode flag, all in 699 commits from 272 contributors.

vLLM 0.17 Ships FlashAttention 4 and Live MoE Scaling

awesomeagents.ai/news/vllm-0-17-0-flashat...

#Vllm #Inference #OpenSource

0 0 0 0

deepseek

@deepseek.activitypub.awakari.com.ap.brid.gy

3 weeks ago

Собственная облачная LLM на 16 ГБ VRAM — часть 1: базовая сборка, tools и MCP Привет, Хабр! На фоне ажиотажа вокруг ней...

#langchain #langgraph #python #vllm #qwen3 #localai #selectel #MCP #ии-агенты #API-сервис

Origin | Interest | Match

1 0 0 0

deepseek

@deepseek.activitypub.awakari.com.ap.brid.gy

3 weeks ago

Из коробки не работает: запускаем свежие большие LLM В последнее время открытых моделей сверхбольшого размера развелось неимоверное количество, даже не просто моделей, а производителей. Вариации GLM, Kimi, DeepSeek занимают по нескольку строк в топ...

Из коробки не работает: запускаем свежие большие LLM В последнее время открытых моделей сверхбольшого разме...

#Kimi-K2.5 #DeepSeek-v3.2 #GLM-5 #Qwen3.5 #vllm #B200

Origin | Interest | Match

0 0 0 0

roxsross

@roxsross.bsky.social

1 month ago

🚀 Docker Model Runner lleva vLLM a macOS con Apple Silicon

vLLM, el motor de inferencia líder, ahora en macOS gracias a vllm-metal.

www.docker.com/blog/docker-model-runner...

#vLLM #AppleSilicon #MLOps #Docker #RoxsRoss

0 0 0 0

Rost Glukhov

@rosgluk.bsky.social

1 month ago

LLM Hosting in 2026: Local, Self-Hosted & Cloud Infrastructure Compared Complete guide to LLM hosting in 2026. Compare Ollama, vLLM, Docker Model Runner, LocalAI and cloud providers. Learn cost, performance, and infrastructure trade-offs.

Complete guide to LLM hosting in 2026. Compare Ollama, vLLM, Docker Model Runner, LocalAI and cloud providers. Learn cost, performance, and infrastructure trade-offs:
www.glukhov.org/llm-hosting/
#AI #LLM #hosting #Self-Hosting #SelfHosting #ollama #vllm #infrastructure

2 0 0 0

GPU CLI

@gpucli.bsky.social

1 month ago

Setup an open source model with #Ollama or #vLLM, but unsure how to connect it to Claude Code?

Don't worry, we've got you covered 💪

1 1 1 0

GPU CLI

@gpucli.bsky.social

1 month ago

Then run 'gpu llm run' from your terminal of choice, select whether you want to use #Ollama or #vLLM for inference and choose the model you want to use.

Here we're opting for the #Z.ai model GLM-4.7 Flash.

1 0 1 0

@techlife-blog.bsky.social

1 month ago

The Hidden Engineering Behind Fast AI: How LLM Inference Actually Works A deep dive into PagedAttention, speculative decoding, FlashAttention, and continuous batching — the clever tricks that make modern LLMs respond in milliseconds instead of minutes.

The Hidden Engineering Behind Fast AI: How LLM Inference Actually Works

techlife.blog/posts/llm-in...

#LLM #Inference #PagedAttention #vLLM #FlashAttention #SpeculativeDecoding #MachineLearning #GPUOptimization #KVCache

0 0 0 0

Thinkronicity ™

@thinkronicity.ispost.ing

1 month ago

Synthetic | Run LLMs, privately Chat with open-source models privately

Where can you prepro #Dev test leading #OpenSource #LLM #AI models that are not 'walled garden' & US monitored #AmericanAI?
Synthetic.new has #PrivacyFirst runnable model choices like #KIMIK2-Thinking, #MiniMax2.1, #Quen3 ++. #vLLM support & use as in #OpenAI tools via #Roo #Cline ++

1 0 0 0

GPU CLI

@gpucli.bsky.social

1 month ago

Remote #GPU network volumes shouldn't require a config file, a cloud console, and 20 minutes of your life.

With GPU CLI, adding a volume is as simple as yes or no.

#Ollama #vLLM #ComfyUI

1 1 0 0

Cedric Clyburn

@cedricclyburn.com

1 month ago

Today kicks off @jfokus.se in Stockholm 🇸🇪 and we just delivered our workshop on building with open source AI models using:

⚡️ #vLLM serve local LLM’s as a local API endpoint

🦜 @langchain4j.dev for adding LLM capabilities in our Java application

Was a huge hit! Slides ⬇️

3 1 1 0

AI포스트(AIPOST) | 인공지능 전문언론

@aipostkorea.bsky.social

2 months ago

“훈련은 돈만 쓴다, 진짜 돈은 추론서 벌어”…2000억 실탄 챙긴 ‘인퍼랙트’ CEO의 일침 vLLM의 주역들이 설립한 인퍼랙트가 1.5억 달러의 실탄을 확보하며 AI 경제학의 패러다임 전환을 선언했습니다. 수조 원을 쏟아붓는 모델 훈련은 결국 ‘비용’일 뿐, AI가 사용자에게 정보를 전달하고 가치를 창출하는 유일한 순간은 ‘추론’이라는 일침입니다. AI포스트 핵심 요약 ✅

📉 "훈련은 밑 빠진 독에 물 붓기?"
시드 투자로만 2,182억 원 챙긴 '인퍼랙트'의 독설

오픈소스 추론 엔진의 끝판왕 'vLLM' 팀이 만든 인퍼랙트가 전장에 뛰어들었습니다. 이제 AI 산업의 승자는 '누가 더 큰 모델을 가졌느냐'가 아니라 '누가 더 효율적으로 추론하느냐'에서 갈릴 것입니다.
www.aipostkorea.com/news/article...

#인퍼랙트 #Inferact #vLLM #사이먼모 #AI인프라 #추론의경제학 #시드투자 #a16z #테크트렌드

1 0 0 0

The Daily Tech Feed

@thedailytechfeed.com

2 months ago

Inferact raises $150M to commercialize vLLM, enhancing AI inference efficiency. Backed by Andreessen Horowitz & Lightspeed. #AI #Inference #TechFunding #vLLM #Inferact Link: thedailytechfeed.com/inferact-rai...

0 0 0 0

AI Daily Post

@aidailypost.com

2 months ago

Andreessen Horowitz just pumped $150M into Inferact’s seed round, pushing its valuation to $800M. The startup’s open‑source vLLM engine could reshape AI model inference. Curious? Dive in. #Inferact #vLLM #SeedFunding

🔗 aidailypost.com/news/andrees...

0 0 0 0

CloudThrill

@cloudthrill.bsky.social

2 months ago

Nice example of a production #vLLM setup on 𝗡𝗲𝗯𝗶𝘂𝘀 with terraform, managed K8s, inference, and observability all in one place.

This can be a ref stack builders can use without reinventing the basics 💡.
👨🏻‍💻 full code on our repo.
github.com/CloudThrill/vllm-production-stack-terraform

1 0 0 0

Kosseila (CloudDude)

@clouddude.bsky.social

2 months ago

vLLM Production Stack on Nebius K8s with Terraform🧑🏼‍🚀 - Cloudthrill This terraform stack delivers a production-ready vLLM serving environment On Nebius Cloud managed Kubernetes supporting Highly optimized GPU inference with operational best practices.

📢 𝗡𝗲𝘄 𝘁𝗲𝗿𝗿𝗮𝗳𝗼𝗿𝗺 #vLLM 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗦𝘁𝗮𝗰𝗸 𝗔𝗰𝗿𝗼𝘀𝘀 𝗖𝗹𝗼𝘂𝗱𝘀 🧑🏼‍🚀 | 𝗣𝗮𝗿𝘁 𝟰: 𝗡𝗲𝗯𝗶𝘂𝘀 𝗖𝗹𝗼𝘂𝗱 💚

🔎 𝗪𝗵𝗮𝘁 𝘆𝗼𝘂'𝗹𝗹 𝗱𝗲𝗽𝗹𝗼𝘆:
✅ Enterprise-grade GPU inference
✅ Secure vllm endpoints (LetsEncrypt)
✅ Full observability: Grafana + vLLM dashboards
✅ Lightning-fast deployment

👉 read the guide: tinyurl.com/Nebiusvllm

2 0 0 1

Jay Lye

@jnlsk.bsky.social

2 months ago

Open Responses: What you need to know We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Opinion: another step forward for scalable agentic workloads in 2026

#huggingface #vllm #openai #llm #ai #artificial-intelligence #langchain #llama-index #vllm #sglang

0 0 0 0

Cedric Clyburn

@cedricclyburn.com

2 months ago

@jfokus.se is BACK for its 20th year and I’m so happy to be hosting a workshop on open source models & how to scale them up on #Kubernetes! We’ll feature projects including #vLLM + @langchain4j.dev + @promptfoo.bsky.social and more for enterprise AI deployment, app dev, and testing 🔥

4 2 0 0

LLMs

@llms.activitypub.awakari.com.ap.brid.gy

2 months ago

[Перевод] Как работает кэширование промптов — PagedAttention и автоматическое кэширование префикса плюс практиче...

#prompt #caching #префилл #декодинг #инференс #LLM #vLLM #PagedAttention #prefix #caching #фрагментация

Origin | Interest | Match

0 0 0 0

CloudThrill

@cloudthrill.bsky.social

2 months ago

🏆Ranked #2 most-read in 2025 - #vLLM for Beginners (Key features)
2️⃣ Here’s the most exhaustive list of VLLM features you wish you knew. 👇
📖 cloudthrill.ca/what-is-vllm...

Learn what makes #vllm the 𝗥𝗼𝗹𝗹𝘀 𝗥𝗼𝘆𝗰𝗲 of Inference in production✨. #vLLM #AIForBeginners

0 0 1 0