DeepMind AGI framework launch and Kaggle hackathon 2026
DeepMind released a cognitive framework to measure progress toward AGI and opened a Kaggle competition to crowdsource evaluation tasks and.
TL;DR
- 01DeepMind released a cognitive framework to measure progress toward AGI and opened a Kaggle competition to crowdsource evaluation tasks and.
- 02DeepMind published a cognitive framework to measure progress toward artificial general intelligence, and opened a Kaggle hackathon inviting the community to develop evaluation tasks and datasets.
- 03The framework maps cognitive abilities to concrete test designs and proposes templates and metrics intended for shared benchmark construction.
DeepMind published a cognitive framework to measure progress toward artificial general intelligence, and opened a Kaggle hackathon inviting the community to develop evaluation tasks and datasets. The framework maps cognitive abilities to concrete test designs and proposes templates and metrics intended for shared benchmark construction.
The framework organizes capabilities into cognitive domains and recommends evaluation axes beyond single-task accuracy, such as sample efficiency, generalization, robustness, and learning speed. DeepMind positions the framework as a tool for researchers and benchmark builders, and the Kaggle hackathon is presented as an immediate mechanism to collect community-built tasks and standardized evaluation code.
What the framework covers
The framework breaks down intelligence into interpretable cognitive faculties, and pairs each faculty with suggested task types and measurement strategies. Domains identified include perception, memory, planning, abstract reasoning, social understanding, and adaptive learning. For each domain the framework offers task templates intended to probe specific abilities under controlled conditions, for example testing planning by varying horizon length and environmental stochasticity, or testing memory by varying retention intervals and interference.
Metrics go beyond top-line accuracy. DeepMind emphasizes measures such as data efficiency, out-of-distribution generalization, robustness to perturbations, latency of learning, and the ability to compose learned skills. The framework also encourages multi-metric reporting so that a model's strengths and weaknesses are visible across dimensions rather than reduced to a single leaderboard score.
The document suggests both synthetic and grounded tasks: simulated environments for reproducible stress tests, curated datasets reflecting real-world complexities, and interactive evaluations where models must learn through interaction. It calls for modular tests that can be combined to form composite assessments, and for metadata standards that make leaderboards and comparisons more interpretable.
The Kaggle hackathon and evaluation goals
DeepMind launched a Kaggle hackathon to accelerate the construction of evaluation tasks that follow the framework. Participants are invited to submit task definitions, dataset processing code, baseline implementations, and evaluation harnesses compatible with recommended metrics. The competition aims to produce a reusable corpus of evaluation modules and reference implementations that others can run on different models and compute budgets.
The hackathon is framed as community-driven: submissions will be shareable, and organizers expect accepted tasks to seed public leaderboards or be integrated into larger evaluation suites. The guidelines emphasize transparent task descriptions, reproducible baselines, and clear metric computation so that independent researchers and smaller labs can run the same evaluations without specialized infrastructure.
DeepMind also highlights concerns about potential failure modes of single-metric leaderboards and asks contributors to include adversarial and stress-test variants. The goal is to encourage benchmarks that reveal brittle behaviors, scaling plateaus, and trade-offs between capabilities such as speed versus reliability.
Why it matters
A shared cognitive framework and an open hackathon direct attention toward standardized, multi-dimensional evaluations rather than single-number comparisons. If adopted broadly, the effort could shift where researchers invest effort, privileging sample efficiency, robustness, and compositional skills alongside raw performance. Community-built tasks may expose shortcomings of current architectures and influence funding, deployment decisions, and where safety researchers concentrate testing resources.
| Item | |||
|---|---|---|---|
| Perception | Visual categorization under occlusion | Accuracy, robustness to noise, latency | |
| Memory | Delayed recall with distractors | Retention accuracy, interference sensitivity, sample efficiency | |
| Planning | Long-horizon navigation with stochastic dynamics | Success rate, planning horizon scaling, compute cost | |
| Abstract reasoning | Novel puzzle composition and analogy tasks | Generalization, few-shot performance, compositionality | |
| Social understanding | Theory-of-mind style prediction tasks | Predictive accuracy, robustness to deceptive signals | |
| Adaptive learning | Rapid adaptation to new tasks from small data | Adaptation speed, final performance, forgetting |
Primary source
Google DeepMind
deepmind.googleThe Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Read next