#nocha hashtag - Bluesky

@markar.bsky.social

1 year ago

We have updated #nocha (long-context benchmark measuring how well models process book-length narratives) with #Llama4 Scout. Sadly, the performance was below the random level, much lower than the reported model's performance on a retrieval task (needle in the haystack). novelchallenge.github.io

6 1 1 0

Marzena Karpinska

@markar.bsky.social

1 year ago

Leaderboard showing performance of language models on claim verification task over book-length input. o1-preview is the best model with 67.36% accuracy followed by Gemini 2.5 Pro with 64.17% accuracy.

We have updated #nocha, a leaderboard for reasoning over long-context narratives 📖, with some new models including #Gemini 2.5 Pro which shows massive improvements over the previous version! Congrats to #Gemini team 🪄 🧙 Check 🔗 novelchallenge.github.io for details :)

11 4 0 0

Marzena Karpinska

@markar.bsky.social

1 year ago

We've added #o1 and #Llama 3.3 70B to the #Nocha leaderboard for long-context narrative reasoning! Surprisingly, o1 performs worse than o1-preview, and Llama 3.3 70B matches proprietary models like gpt4o-mini & gemini-Flash. Check out our website for more results! More in 🧵

36 4 1 1

Marzena Karpinska

@markar.bsky.social

1 year ago

NoCha leaderboard

Also, if you wanna read more about #nocha, check out our website: novelchallenge.github.io

0 0 0 0

Marzena Karpinska

@markar.bsky.social

1 year ago

Image showing prompt token count as per the tokenizer (tiktoken) which is 117,609 tokens, and as per what openai API claims it to be, which is 125,385 tokens. There is about 7000 extra tokens added coming from who knows where.

I really wanted to run NEW #nocha benchmark claims on #o1 but it won't behave 😠
- 6k reasoning tokens is often not enough to get an ans and more means being able to process only short books
- OpenAI adds sth to the prompt: ~8k extra tokens-> less room for book+reason+generation!

6 3 1 1