Advertisement ยท 728 ร— 90
#
Hashtag
#nocha
Advertisement ยท 728 ร— 90
Post image

We have updated #nocha (long-context benchmark measuring how well models process book-length narratives) with #Llama4 Scout. Sadly, the performance was below the random level, much lower than the reported model's performance on a retrieval task (needle in the haystack). novelchallenge.github.io

6 1 1 0
Leaderboard showing performance of language models on claim verification task over book-length input. o1-preview is the best model with 67.36% accuracy followed by Gemini 2.5 Pro with 64.17% accuracy.

Leaderboard showing performance of language models on claim verification task over book-length input. o1-preview is the best model with 67.36% accuracy followed by Gemini 2.5 Pro with 64.17% accuracy.

We have updated #nocha, a leaderboard for reasoning over long-context narratives ๐Ÿ“–, with some new models including #Gemini 2.5 Pro which shows massive improvements over the previous version! Congrats to #Gemini team ๐Ÿช„ ๐Ÿง™ Check ๐Ÿ”— novelchallenge.github.io for details :)

11 4 0 0

We've added #o1 and #Llama 3.3 70B to the #Nocha leaderboard for long-context narrative reasoning! Surprisingly, o1 performs worse than o1-preview, and Llama 3.3 70B matches proprietary models like gpt4o-mini & gemini-Flash. Check out our website for more results! More in ๐Ÿงต

36 4 1 1
NoCha leaderboard

Also, if you wanna read more about #nocha, check out our website: novelchallenge.github.io

0 0 0 0
Image showing prompt token count as per the tokenizer (tiktoken) which is 117,609 tokens, and as per what openai API claims it to be, which is 125,385 tokens. There is about 7000 extra tokens added coming from who knows where.

Image showing prompt token count as per the tokenizer (tiktoken) which is 117,609 tokens, and as per what openai API claims it to be, which is 125,385 tokens. There is about 7000 extra tokens added coming from who knows where.

I really wanted to run NEW #nocha benchmark claims on #o1 but it won't behave ๐Ÿ˜ 
- 6k reasoning tokens is often not enough to get an ans and more means being able to process only short books
- OpenAI adds sth to the prompt: ~8k extra tokens-> less room for book+reason+generation!

6 3 1 1