Inverse Scaling in Test-time Compute
 2 •
                        Alexander Hägele
 2 •
                        Alexander Hägele 3 •
                        Runjin Chen
 3 •
                        Runjin Chen 4 •
                        Andy Arditi
 4 •
                        Andy Arditi 
                     •
                        Kit Fraser-Taliente
 •
                        Kit Fraser-Taliente •
                        Henry Sleight5 •
                        Linda Petrini6
 •
                        Henry Sleight5 •
                        Linda Petrini6
                     Anthropic Fellows Program •
                        1Anthropic •
                        2University of Edinburgh •
                        3EPFL •
                        4University of Texas at Austin
Anthropic Fellows Program •
                        1Anthropic •
                        2University of Edinburgh •
                        3EPFL •
                        4University of Texas at Austin5Constellation • 6Independent • 7Scale AI • 8MiniML.AI
* Now at Meta
Abstract
We construct evaluation tasks where extending reasoning length deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four categories: red herring tasks with embedded distractors, spurious correlation tasks, constraint satisfaction tasks, and advanced AI risks.
Our analyses show that different Large Reasoning Models (LRMs) exhibit distinct failure modes: Claude models become increasingly distracted by irrelevant information in a given prompt as they reason longer; OpenAI o-series models resist distractors but shows pronounced overfitting to problem framings; in spurious correlation tasks, extended reasoning causes models to shift from reasonable priors to plausible but incorrect features, though providing few-shot examples largely corrects this behavior; in constraint satisfaction tasks, all models show performance degradation with extended reasoning, suggesting difficulties in maintaining focus during complex deductive tasks; and in AI safety evaluation tasks, we find that extended reasoning can amplify model-specific concerning behaviors, with Claude Sonnet 4 showing increased expressions of self-preservation in longer reasoning traces.
These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns. Our results demonstrate the importance of evaluating models across diverse reasoning lengths to identify and address these failure modes in LRMs.
Demo
Explore how different models perform on inverse scaling tasks across various reasoning budgets. Compare performance between baseline (no thinking) and maximum reasoning budget to see inverse scaling effects.
Key Findings
Our evaluation reveals important insights about how test-time scaling affects model performance across different types of reasoning tasks.