#flashattention hashtag - Bluesky

@deepseek.activitypub.awakari.com.ap.brid.gy

1 week ago

谷歌一篇论文砸崩内存巨头？不懂“显存墙”，怎么做 AI 时代的工程师！ - Tony Bai 本文永久链接 - https://tonybai.com/2026/03/28/ai-engineer-gpu-introduction-course 大家好，我是Tony Bai。就在最近，科技界发生了一件极其戏剧性的事情。本周三美股开盘，全球存储产业巨头——美光、西部数

谷歌一篇论文砸崩内存巨头？不懂“显存墙”，怎么做 AI 时代的工程师！本文永久链接 – tonybai.com/2026/03/28/ai-engineer-g... 大家好...

#技术志 #AIModel #AI模型 #ArtificialIntelligence #AttentionMechanism #ComputeBound #ComputingPower #CUDA #FlashAttention #FP8 #Go

Origin | Interest | Match

2 1 0 0

AI Daily Post

@aidailypost.com

4 weeks ago

Turns out bigger CUDA tiles can actually slow down Flash Attention – TFLOPS drop 18‑43% across sequence lengths. See how kernel tweaks and compute efficiency matter for NVIDIA GPUs and transformer models. #FlashAttention #CUDATiles #GPUPerformance

🔗 aidailypost.com/news/large-c...

0 0 0 0

Thiard News@F4F

@newsen.bsky.social

1 month ago

Together AI Celebrates Major Achievements at Its Inaugural AI Native Conference Together AI celebrated significant advancements at the AI Native Conf, showcasing breakthroughs in AI infrastructure and research. Join the AI revolution!

Together AI Celebrates Major Achievements at Its Inaugural AI Native Conference #USA #San_Francisco #Together_AI #AI_Native_Conf #FlashAttention

0 0 0 0

@techlife-blog.bsky.social

1 month ago

The Hidden Engineering Behind Fast AI: How LLM Inference Actually Works A deep dive into PagedAttention, speculative decoding, FlashAttention, and continuous batching — the clever tricks that make modern LLMs respond in milliseconds instead of minutes.

The Hidden Engineering Behind Fast AI: How LLM Inference Actually Works

techlife.blog/posts/llm-in...

#LLM #Inference #PagedAttention #vLLM #FlashAttention #SpeculativeDecoding #MachineLearning #GPUOptimization #KVCache

0 0 0 0

LLMs

@llms.activitypub.awakari.com.ap.brid.gy

3 months ago

Ускоряем LLM по максимуму. Как я создал кросс-платформенный Flash Attention с поддержкой Turing+ архитектур и не только ...

#машинное #обучение #transformers #трансформеры #внимание #attention #flashattention #triton #большие #языковые #модели

Origin | Interest | Match

0 0 0 0

AI Daily Post

@aidailypost.com

4 months ago

New update: PyTorch + NVIDIA BioNeMo now support attn_input_format for flash‑attention scaling. Faster ESM3 runs, cu_seq_lens_q tweaks, and smoother Hugging Face integration. Dive in to see how Transformer Engine gets a boost! #PyTorch #NVIDIA #flashattention

🔗 aidailypost.com/news/pytorch...

0 0 0 0

@binshi.bsky.social

6 months ago

Nathan's Blog

It traces the execution from the PyTorch function, through the launcher's setup (grid, block sizes), to the highly-optimized Triton JIT kernel code.

#FlashAttention #Triton #LLMs #GPUKernel #DeepLearning

1 0 0 0

@qdrddr.bsky.social

6 months ago

⚡ Universal Metal #FlashAttention on 🍏 #AppleSilicon — 1.14–1.48x faster image training vs #PyTorch, 25–40% memory savings with FP32 💾
🔗 Link in first 💬⤵️

Repost 🔁 #AI #LLM #RAG #MPS

1 0 1 0

Kosseila (CloudDude)

@clouddude.bsky.social

9 months ago

vLLM for beginners: Key Features & Performance Optimization(PartII) - Cloudthrill In this series, we aim to provide a solid foundation of vLLM core concepts to help you understand how it works and why it’s emerging as a defacto choice for LLM deployment.

🚀#NewBlog #vllm🔥
𝐯𝐋𝐋𝐌 𝐟𝐨𝐫 𝐁𝐞𝐠𝐢𝐧𝐧𝐞𝐫𝐬 𝐏𝐚𝐫𝐭 𝟐:📖𝐊𝐞𝐲 𝐅𝐞𝐚𝐭𝐮𝐫𝐞𝐬 & 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧s
💎 What makes #vLLM the Rolls Royce of inference?
👉check it out: cloudthrill.ca/what-is-vllm...

✅ #PagedAttention #PrefixCaching #ChunkedPrefill
✅ #SpeculativeDecoding #FlashAttention #lmcache
✅ Tensor & #PipelineParallelism⚡