Advertisement · 728 × 90

Posts by A.V.

Chassis looks great now! But the single cooler solution seems like a bit of a limiting factor for fatter chips now. Still, if memory/ssd prices weren't insane, this would be a cool little machine.

Alas...

2 hours ago 2 0 0 0
A table titled "Kimi K2.6 vs K2.5," sub-headed "Generational lift & position among frontier." It compares the **Kimi K2.6** model against its predecessor (**K2.5**) and other frontier models including **GPT-5.4 xhigh**, **Gemini 3.1 Pro**, **Opus 4.6**, **Opus 4.7**, and **Mythos**.
The table highlights "Generational Lift" (\Delta), which is the performance increase from K2.5 to K2.6.
### Key Sections
**1. Agentic • Search • Tool Use**
 * **Top Performance:** Kimi K2.6 shows massive gains in tool use, specifically **Toolathlon** (+22.2) and **MCPMark** (+26.4).
 * **Leaders:** Kimi leads in **DeepSearchQA accuracy** (83.0) and **WideSearch** (80.8). However, **Mythos** leads the HLE-Full w/ tools benchmark (64.7).
**2. Coding**
 * **Top Performance:** Kimi K2.6 shows a significant lift in **Terminal-Bench 2.0** (+15.9).
 * **Leaders:** **Opus 4.7** leads most coding categories, including **SWE-Bench Verified** (87.6) and **Terminal-Bench** (69.4). Kimi leads in **SWE-Bench Pro** (58.6).
**3. Reasoning & Knowledge**
 * **Top Performance:** High scores across the board, but the generational lift is smaller (e.g., **AIME 2026** only moved +0.6).
 * **Leaders:** **GPT-5.4** leads in **AIME 2026** (99.2) and **HMMT 2026** (97.7). **Mythos** leads **HLE-Full (no tools)** at 56.8.
**4. Vision**
 * **Top Performance:** The largest single gain in the chart is **BabyVision w/ python**, where Kimi K2.6 improved by +28.0 points over K2.5.
 * **Leaders:** **Gemini 3.1 Pro** leads **MMMU-Pro** (83.0), while **GPT-5.4** leads **MathVision** (92.0) and **V* w/ python** (98.4).
### Biggest Generational Lifts (K2.5 \rightarrow K2.6)
| Benchmark | K2.5 | K2.6 | Lift (\Delta) | Category |
|---|---|---|---|---|
| **BabyVision w/ python** | 40.5 | 68.5 | **+28.0** | Vision (Python-augmented) |
| **MCPMark** | 29.5 | 55.9 | **+26.4** | Agentic (Tool orchestration) |
| **Toolathlon** | 27.8 | 50.0 | **+22.2** | Agentic (Long-horizon tools) |
| **APEX-Agents** | 11.5 | 27.9 | **+16.4** | Ag…

A table titled "Kimi K2.6 vs K2.5," sub-headed "Generational lift & position among frontier." It compares the **Kimi K2.6** model against its predecessor (**K2.5**) and other frontier models including **GPT-5.4 xhigh**, **Gemini 3.1 Pro**, **Opus 4.6**, **Opus 4.7**, and **Mythos**. The table highlights "Generational Lift" (\Delta), which is the performance increase from K2.5 to K2.6. ### Key Sections **1. Agentic • Search • Tool Use** * **Top Performance:** Kimi K2.6 shows massive gains in tool use, specifically **Toolathlon** (+22.2) and **MCPMark** (+26.4). * **Leaders:** Kimi leads in **DeepSearchQA accuracy** (83.0) and **WideSearch** (80.8). However, **Mythos** leads the HLE-Full w/ tools benchmark (64.7). **2. Coding** * **Top Performance:** Kimi K2.6 shows a significant lift in **Terminal-Bench 2.0** (+15.9). * **Leaders:** **Opus 4.7** leads most coding categories, including **SWE-Bench Verified** (87.6) and **Terminal-Bench** (69.4). Kimi leads in **SWE-Bench Pro** (58.6). **3. Reasoning & Knowledge** * **Top Performance:** High scores across the board, but the generational lift is smaller (e.g., **AIME 2026** only moved +0.6). * **Leaders:** **GPT-5.4** leads in **AIME 2026** (99.2) and **HMMT 2026** (97.7). **Mythos** leads **HLE-Full (no tools)** at 56.8. **4. Vision** * **Top Performance:** The largest single gain in the chart is **BabyVision w/ python**, where Kimi K2.6 improved by +28.0 points over K2.5. * **Leaders:** **Gemini 3.1 Pro** leads **MMMU-Pro** (83.0), while **GPT-5.4** leads **MathVision** (92.0) and **V* w/ python** (98.4). ### Biggest Generational Lifts (K2.5 \rightarrow K2.6) | Benchmark | K2.5 | K2.6 | Lift (\Delta) | Category | |---|---|---|---|---| | **BabyVision w/ python** | 40.5 | 68.5 | **+28.0** | Vision (Python-augmented) | | **MCPMark** | 29.5 | 55.9 | **+26.4** | Agentic (Tool orchestration) | | **Toolathlon** | 27.8 | 50.0 | **+22.2** | Agentic (Long-horizon tools) | | **APEX-Agents** | 11.5 | 27.9 | **+16.4** | Ag…

mythos vs opus 4.7 vs cursor composer vs K2.6 on non-cherry-picked benchmarks

result: yup, still looking good

1 day ago 7 2 0 0

benchmark scores are truly impressive. hope kimi doesn't stop.

1 day ago 0 0 0 0
Post image

Kimi 2.6 is now available on @hf.co 🔥🎉
huggingface.co/moonshotai/K...

✨ 1T MoE / 32B active / 256K context
✨ Agent Swarm: 300 sub-agents × 4,000 steps
✨ Modified MIT

1 day ago 30 6 2 0

yup, they do sound different. there's probably no objective best, unfortunately, you just gotta roll with the sound signature you like (and the featureset, if it's wireless).

2 days ago 1 0 0 0

roko's basilisk hits different in the agent era

2 days ago 11 0 0 0

more of a leaning desk, really.

3 days ago 1 1 0 0

the cthulhu claude logo, the scary name and the product being an attempt at a hive mind.

I kinda like the combo, shame about the anthropic brown™

4 days ago 4 0 0 1
Advertisement
Post image

Qwen3.6 35B-A3B can now be run locally! 💜

The model is the strongest mid-sized LLM on nearly all benchmarks.

Run on 23GB RAM via Unsloth Dynamic GGUFs.

GGUFs to run: huggingface.co/unsloth/Qwen...
Guide: unsloth.ai/docs/models/...

5 days ago 30 6 0 0
Preview
Introducing Claude Opus 4.7 Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

OPUS 4.7 IS HERE!!!! BETTER VISION!!!!! NEW REASONING BUDGET!!! 64% ON SWE PRO!!

5 days ago 47 1 2 3

a cookie for honesty 🍪

1 week ago 0 0 0 0

MYTHOS SYSTEM CARD PREVIEW!!!

www-cdn.anthropic.com/53566bf5440a...

2 weeks ago 14 2 1 1
Post image

MYTHOS CONFIRMED!!!!!!

2 weeks ago 101 7 1 7

the new shaders are really something

2 weeks ago 4 0 1 0
Post image

Claude made me a chart comparing benchmarks for the larger Gemma 4 models against similar Qwen3.5 ones

2 weeks ago 18 3 0 0
Post image Post image

A new Anthropic paper argues for functional emotions in LLMs, claiming a causal link between emotional representations and model behavior. transformer-circuits.pub/2026/emotion...

2 weeks ago 61 5 0 5
Post image Post image Post image

classifying the new composer tech report as a must read cursor.com/resources/Co...

3 weeks ago 14 2 2 0

ah, I didn't expect this to actually be about harry, my bluesky isolation must be exceptional. sorry you have to not care so hard...

3 weeks ago 1 0 1 0
Advertisement

what did harry do this time...

3 weeks ago 0 0 1 0

paldies, šis jau drusku cerīgāk izskatās, bet ir vieta izaugsmei. poļu elevenlabs tiešām labi izskatās šeit.

3 weeks ago 0 0 1 0

paldies par pūlēm! akmens tildes dārziņā, ka nevar ērtāk tikt klāt šim resursam...

3 weeks ago 0 0 2 0

paldies! jāsaka gan, ka hugo.lv ir antīks projekts (it sevišķi AI ērā) un bez papildus finansējuma noteikti nekas tur nav baigi atjaunots. es liktu lielākas cerības uz tildes mājaslapu: tilde.ai/lv/speech-to...

3 weeks ago 0 0 1 0

kur tilde?
viņiem arī ir speech to text, būtu interesanti redzēt salīdzinājumu ar citiem, foršs tests

3 weeks ago 0 0 1 0
Video

Inspired by the man who built a personalized cancer vaccine for his dog, I’ve written an open-source guide to DIY mRNA vaccine production:
philfung.github.io/openvaxx

3 weeks ago 35 7 1 3

as a naive claudecel, why do you, uhh, not just use codex instead then

3 weeks ago 2 0 1 0
The cuTile Rust Book — cuTile Rust

I was impressed by the docs page as well: nvlabs.github.io/cutile-rs/in...

3 weeks ago 0 0 0 0

Btw, nvidia published cutile-rs not so many days ago, a Rust version of cuTile DSL for programming cuda kernels from inside Rust. Only a research project, but it looks very cool and has quite a few features already.
github.com/NVlabs/cutil...

3 weeks ago 0 0 1 0
Advertisement

thx claude for explanation, I was lost there for a second, ngl.

3 weeks ago 8 0 1 0
Post image

LongCat-Next 🐱 A multimodal foundation model released by
Meituan

huggingface.co/meituan-long...

✨ 74B total - 3B active - MIT
✨ One token space for all modalities
✨ DiNA paradigm for unified learning
✨ Seeing, creating, talking all in one

3 weeks ago 12 2 0 0

but its clear by now that if we ever get to space properly, it'll be much weirder than the mainstream space truckers.

4 weeks ago 2 0 0 0