Whether you’re a researcher, builder, or just curious about AI’s cultural limitations, join this conversation!
Learn more: cohere.com/events/coher...
Posts by Cohere Labs
Ananya Sahu & Mehrnaz Mofakhami, research scholars at Cohere Labs, will explore:
🌏 How cultural awareness is currently tested in AI and what challenges remain
📜 What we’re learning from Tiny Aya and its support for underrepresented languages
💬 Your diverse perspectives and what you want to see next
AI is getting better at math. Better at code. But is it getting better at understanding cultural nuances? 🤔
Join us for “Cultural Awareness in AI — From Knowledge Tests to Social Norms and Beyond”, a conversation on what it means to build AI systems that work at global scale.
Does AI truly understand different cultures and languages?
We’re surveying cultural awareness in real-world AI use.
✨ When cultural awareness matters in real-world AI use
💡 Whether AI reflects diverse norms, communication styles & knowledge
🫥Where AI falls short in cultural understanding
1) what? Cohere is here?!!!!
2) this is crazy
Woo hoo, who would have thought Canada would produce efficient massively multicultural models
🌱Very proud of our team's latest release 😊 meet Tiny Aya, a massively multilingual model with 3.35B parameters.
Tech report here: github.com/Cohere-Labs/...
Tiny Aya is small enough to run on a phone and powerful enough to support 70+ languages. That unlocks offline translation, local education tools, community research, and real multilingual experimentation without cloud infrastructure. 📱
Tiny Aya shows what smaller models can do. It improves on previous Aya releases and outperforms models at similar size proving that smart multilingual design can rival larger models. This shows that focused multilingual research beats brute-force scaling—achieving more with less.
Built for balance, we narrow performance gaps across languages: Most multilingual models skew toward high-resource languages. Tiny Aya narrows that gap, sustaining stronger performance for underrepresented languages. 📈
Despite being smaller, Tiny Aya competes with 4B models across translation, mathematical reasoning, understanding, and generation with especially strong gains for African languages. 🌍
We take a stance for language diversity. Going beyond the one-fits-all paradigm, we release not only one instruction-finetuned model balancing all 70 languages (Tiny Aya Global), but accompany it with three region-focused models 🌐
Introducing ✨Tiny Aya✨, a family of massively multilingual small language models built to run where people actually are.
Tiny Aya delivers strong multilingual performance in 70+ global languages in a 3.35B parameter model, efficient enough to run locally, even on a phone.
And Research Engineer, @shivalika.bsky.social : The Leaderboard Illusion. 😶🌫️
This paper reveals systematic biases and transparency gaps in the Chatbot Arena leaderboard.
www.youtube.com/watch?v=URho...
Sr Research Scientist, @juliakreutzer.bsky.social: Treasure Hunt paper. 🗺️
This work introduces a method to improve model performance by adding markers to tokens of the pretraining data, enabling real-time targeting of the long tail using training-time markers.
www.youtube.com/watch?v=K3BU...
Excited to have two of our papers featured in
@j-novikova-nlp.bsky.social's @wiair.bsky.social podcast, as part of the NeurIPS reflection. ✨
Learn more / subscribe here women-in-ai-research.github.io and check out this thread 🧵 for our features...
What an incredible week it’s been at #NeurIPS2025! 🎉
Today is our last one at the booth. We've had a great week connecting with our community in San Diego.
Join our community to continue to connect with our research team: https://cohere.com/research/open-science/application
What's the story of your legend?
Join ML researchers building their legends with 40 cards that capture our shared journey—explore and build yours: https://lab-legends.vercel.app/ 🎯
Just 1 day left until #NeurIPS2025 kicks off! The Cohere and Cohere Labs teams are ready to dive into a packed week of research, conversations, and community at the San Diego Convention Center✨
Come visit our booth — we’d love to chat and send you home with some swag!
... @markusfreitag.bsky.social, Roman Grundkiewicz, @yupenghou.bsky.social, @phikoehn.bsky.social, @juliakreutzer.bsky.social, Saab Mansour, @sted19.bsky.social, Lorenzo Proietti, Parker Riley, Eduardo Sánchez, @patuchen.bsky.social, Mariya Shmatova, @zouharvi.bsky.social
You can find all details in our paper www2.statmt.org/wmt25/pdf/20... or discuss with us next week at the WMT Conference at #EMNLP2025.
Led by @kocmitom.bsky.social, Ekaterina Artemova, Eleftherios Avramidis, Eleftheria Briakou, @pinzhen.bsky.social, @mziizm.bsky.social...
⚖️ LLM-as-a-judge: mixed reliability.
Top systems reach ~95% pairwise accuracy open-ended and summarization tasks.
Smaller ones barely beat coin-flip territory at ~55%.
🤖Naturalness is still a significant challenge.
Across open-ended generation and cross lingual summarization, the biggest weakness isn’t coherence or accuracy, but it is sounding like a native speaker. Many outputs still feel robotic or translated.
🧠English isn’t always easiest.
Models like Gemini 2.5 Pro and Claude 4 sometimes did better in Korean, German, or Spanish than in English when solving reasoning tasks.
🧩Linguistic reasoning remains the toughest nut. 🥥
Even top models scored below 50% on linguistic reasoning tasks, showing that structured linguistic deduction is still an open challenge.
🌐 Language coverage matters.
Models don’t support all languages equally, and this skews rankings. Smaller open models especially struggle with broad coverage, affecting their aggregate ranking ⚠️
🧩 Linguistic reasoning on unseen languages
📝 Open-ended generation testing naturalness and usefulness
📘 Cross-lingual summarization
🔁 Machine translation
🧑⚖️ LLM-as-a-Judge evaluating outputs of other models
All backed by human evals and public releases of data + outputs!
github.com/wmt-conferen...
How well do LLMs handle multilinguality? 🌍🤖
🔬We brought the rigor from Machine Translation evaluation to multilingual LLM benchmarking and organized the WMT25 Multilingual Instruction Shared Task spanning 30 languages and 5 subtasks.
River, Yinhong and I will all be in person and we look forward to the discussions!