Why PTPC-FP8 rocks:
- Per-Token Activation Scaling: Each token gets its own scaling factor
- Per-Channel Weight Scaling: Each weight column (output channel) gets its own scaling factor
Delivers FP8 speed with accuracy closer to BF16 – the best FP8 option for ROCm! [2/2]
Posts by Embedded LLM
vLLM Blog Alert! vLLM introduces PTPC-FP8 quantization on AMD ROCm, delivering near-BF16 accuracy at FP8 speeds. Run LLMs faster on @AMD MI300X GPUs – no pre-quantization required!
Get started: pip install -U vllm, add --quantization ptpc_fp8.
Full details: blog.vllm.ai/2025/02/24/p...
[1/2]
Recap 2024, we've embraced open-source, contributing to vLLM with 211 PRs, 65K+ LOC, and expanded VLM support. Launched #JamAIBase, an AI spreadsheet with 620+ stars, and on 🤗 we have 1.75M+. Collaborated with Liger Kernel & infinity for AMD GPU support. Let's make 2025 even more impactful together!
🚀 Liger-Kernel is making waves! Check out the latest LinkedIn Eng blog post on how Liger improve #LLM training efficiency with Triton kernels.
20% throughput boost & 60% memory reduction for models like Llama, Gemma & Qwen with just one line of code! Works on AMD!
www.linkedin.com/blog/enginee...
🔥 Big thanks to Michael Feil for the epic collab on supercharging embedding & reranking on AMD GPUs with Infinity♾!
Check out the guide on 🤗 Hugging Face for how to leverage this high throughput embedding inference engine!
huggingface.co/blog/michael...
vLLM now supports running GGUF models on AMD Radeon GPUs, with impressive performance on RX 7900XTX. Outperforms Ollama at batch size 1, with 62.66 tok/s vs 58.05 tok/s.
Check it out: embeddedllm.com/blog/vllm-no...
What's your experience with vLLM on AMD? Any features you want to see next?
🚨 GPUs wasting 75% of training time on communication 🤯 Not anymore!
DeepSpeed Domino, with a new tensor parallelism engine, minimizes communication overhead for faster LLM training. 🚀
✅ Near-complete communication hiding
✅ Multi-node scalable solution
Blog: github.com/microsoft/De...
Liger Kernels v0.4.0 ROARS onto AMD GPUs! 🚀 Faster LLM training, less memory, LONGER context lengths! Check out the benchmarks! embeddedllm.com/blog/cuda-to...
@hotaisle.bsky.social
Pixtral Large benchmarks
🔥 Pixtral Large is now supported on vLLM! 🔥
Run Pixtral Large with multiple input images from day 0 using vLLM.
Install vLLM:
pip install -U VLLM
Run Pixtral Large:
vllm serve mistralai/Pixtral-Large-Instruct-2411 --tokenizer_mode mistral --limit_mm_per_prompt 'image=10' --tensor-parallel-size 8
New Models:
- Idefics3 (VLM)
- H2OVL-Mississippi (VLM for OCR/docs!)
- Qwen2-Audio (Audio LLM)
- FalconMamba
- Florence-2 (VLM)
Plus new encoder-decoder embedding models like BERT, RoBERTa, XLM-RoBERTa.