Posts by Kaggle
An event thumbnail for a Kaggle Live Q&A session titled "Measuring Progress Toward AGI - Cognitive Abilities Hackathon." The text announces the event will take place live on April 1st at 10 AM PT (GMT-7). The Kaggle logo is in the top right corner, and the design features bright blue, green, and yellow abstract wavy shapes in the corners.
Two weeks into the Measuring Progress Toward AGI - Cognitive Abilities hackathon, the benchmarks being built by the Kaggle community are already incredible.
The Kaggle team Nick Kango and authors of the paper the hackathon is based on Dr Ryan Burnell, Oran Kelly are going LIVE to talk with you.
Paper Track: Document your approach and contribute to AI generalization research. www.kaggle.com/competitions...
ARC-AGI-3: Tackle a harder interactive benchmark that requires exploration and multi-step reasoning. www.kaggle.com/competitions...
ARC-AGI-2: Predict outputs for novel reasoning tasks your system has never seen. www.kaggle.com/competitions...
💰 $2M Prize Pool
⏰ Entry Deadline: October 26, 2026
Develop approaches that learn quickly, generalize well, and solve problems never seen before.
Compete in one or all three ARC Prize 2026 competitions to help move AI closer to systems that learn like people do: flexible, efficient, and ready for new challenges.
Real intelligence isn't about memorizing answers - it's knowing what to do when the problem changes.
Most benchmarks reward pattern recognition, not genuine problem-solving. ARC Prize 2026, in partnership with Arc Prize, challenges you to build adaptive AI through three connected competitions.
We’re opening up the Kaggle toolbox to everyone. 🛠️
Today, we’re launching Community Hackathons - a free, self-serve way for you to host your own AI challenges. Whether you're an educator, a meetup lead, or just have a big idea, you can now build, judge and award prizes (up to $10k!).
All challenge compute runs on Google Cloud G4 VMs with NVIDIA RTX 6000 Blackwell GPUs. This provides the memory and speed needed for LoRA fine-tuning and scaling inference.
The G4 VMs are available to help you iterate quickly on your reasoning models.
Participants will start with a Nemotron-3 Nano baseline and a novel reasoning benchmark from NVIDIA Research. The goal is to develop techniques that push the boundaries of reasoning accuracy using open models.
💰 $106,388 Prize Pool
⏰ Entry Deadline: June 8, 2026
Reasoning benchmarks are vital for measuring progress on structured tasks and when we share methods openly, the entire community moves faster.
To put this into practice, we’re excited to announce the NVIDIA Nemotron Model Reasoning Challenge hosted by NVIDIA and powered by Google Cloud Partners.
Learn more about the hackathon: www.kaggle.com/competitions...
The challenge is to design Kaggle Benchmarks that test how frontier AI models reason, learn, and make decisions going beyond pattern recognition and memorization.
💰 $200,000 Prize Pool
⏰ Final Submission Deadline: Apr 16, 2026
Earlier today, Google DeepMind released a new paper proposing a scientific framework for measuring the cognitive abilities of AI systems on the path to AGI.
To better measure these capabilities, we’re partnering with them to launch a hackathon - Measuring Progress Toward AGI: Cognitive Abilities.
📢 Exciting News!
You can now receive notifications for Benchmarks on Kaggle! 🔔
You can now follow a benchmark to stay updated with alerts for new benchmark versions, new models added on leaderboards, and notifications for benchmark owners when new models are available to run.
📣 Competition Launch Alert!
BirdCLEF+ 2026 hosted by @cornellbirds.bsky.social
🎯 Identify species from real-world audio
💰 $50,000 Prize Pool
⏰ Entry Deadline: May 27, 2026
🙏 TU Chemnitz & Google DeepMind
Learn more at www.kaggle.com/competitions...
📣 Competition Launch Alert! Our 12th annual March ML Mania competition is here!
🎯 Forecast the outcomes of the 2026 NCAA basketball tournaments by predicting the probabilities of every possible matchup
💰 $50,000 Prize Pool
⏰ Final Submission: March 19th, 2026
www.kaggle.com/competitions...
Explore the new Four-in-a-Row leaderboard 👇
www.kaggle.com/benchmarks/k...
To make this a true reasoning test:
• Models have no access to minimax solvers or precomputed game trees
• Every move must be justified in natural language before it’s executed
• The deterministic rules eliminate ambiguity
This isolates structured planning and spatial consistency.
The challenge isn’t knowing the rules.
Models must navigate a 7×6 grid, account for gravity (pieces fall vertically), anticipate diagonal and vertical threats, and plan several moves ahead all through text alone.
A Kaggle "Game Arena Four in a Row Leaderboard" comparing 10 AI models. The table ranks models by (Internal) Elo, Average Output Tokens, and Average Inference Cost. Top Performer: Gemini 3 Pro Preview (477 Elo, 11.90¢ per turn). Runner Up: GPT-5.2 (450 Elo, 14.27¢ per turn). Mid-Range: o3 (313 Elo), Grok 4 (313 Elo), and Gemini 3 Flash Preview (312 Elo). Efficiency Leader: DeepSeek V3.2 ranks 10th (0 Elo) but features the lowest cost at 0.33¢ per turn. The footer notes that ratings use the Bradley-Terry algorithm based on 80 games per model pair.
Four-in-a-Row is a “solved” game. Frontier LLMs still can’t play it reliably.
📢 We just launched a new Game Arena leaderboard to test how models reason step-by-step, maintain a mental board and plan moves - no minimax, no game-tree shortcuts.
📢 Exciting News!
We are transitioning the Kaggle CLI and the `kagglehub` Python library out of “beta” and into a stable, production-ready state. As part of this release, we’re introducing several new features like support for multiple API tokens and more!
🏆 5 tasks will be featured on our official channels.
🏅 Selected creators earn a Task Tuesday Award on Kaggle.
Got a benchmark? Drop the link in the comments on our forum post here: 👉 www.kaggle.com/discussions/...
The Game Arena event has concluded but the analysis is just beginning. 🤖
We're looking for the best community-created benchmarks that propose new games or dynamic tests for LLMs to feature for this week’s #TaskTuesday!