#BullshitBenchmark hashtag - Bluesky

Jesus Castagnetto 🇵🇪 (@jmcastagnetto@mastodon.social)

@jmcastagnetto.bsky.social

3 weeks ago

BullshitBench v2: Claude and Qwen Are the Only Models That Push Back - Adam Holter BullshitBench v2 is out. Peter Gostev tested 70+ model variants across 100 questions spanning coding, medical, legal, finance, and physics. The benchmark measures one specific thing: whether a model w...

A cool test of how much different #AI models #hallucinate: the #BullshitBenchmark

The #Claude and #Qwen models seem to push back more when confronted with nonsensical questions. #OpenAI models do poorly.

Blog post: adam.holter.com/bullshitbenc...
Results: petergpt.github.io/bullshit-ben...

#LLM