NazoNazo Benchmark Evaluates Insight Reasoning in Large Language Models
The NazoNazo benchmark tests insight reasoning with Japanese riddles; humans scored 52.9% accuracy on a set of 120 riddles, and only GPT‑5 came close to that level. getnews.me/nazonazo-benchmark-evalu... #nazonazo #llmevaluation #riddles
0
0
0
0