Inverse Scaling in Test-time Compute

Aryo Pradipta GemaAFP 2Alexander HägeleAFP 3Runjin ChenAFP 4Andy ArditiAFP
Jacob Goldman-WetzlerAFPKit Fraser-TalienteAFPHenry Sleight5Linda Petrini6
Julian Michael7,*Beatrice Alex2Pasquale Minervini2,8
Yanda Chen1Joe Benton1Ethan Perez1
AFPAnthropic Fellows Program • 1Anthropic • 2University of Edinburgh • 3EPFL • 4University of Texas at Austin
5Constellation • 6Independent • 7Scale AI • 8MiniML.AI
* Now at Meta
Corresponding authors: aryo.gema@ed.ac.uk, ethan@anthropic.com

Abstract

We construct evaluation tasks where extending reasoning length deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four categories: red herring tasks with embedded distractors, spurious correlation tasks, constraint satisfaction tasks, and advanced AI risks.

Our analyses show that different Large Reasoning Models (LRMs) exhibit distinct failure modes: Claude models become increasingly distracted by irrelevant information in a given prompt as they reason longer; OpenAI o-series models resist distractors but shows pronounced overfitting to problem framings; in spurious correlation tasks, extended reasoning causes models to shift from reasonable priors to plausible but incorrect features, though providing few-shot examples largely corrects this behavior; in constraint satisfaction tasks, all models show performance degradation with extended reasoning, suggesting difficulties in maintaining focus during complex deductive tasks; and in AI safety evaluation tasks, we find that extended reasoning can amplify model-specific concerning behaviors, with Claude Sonnet 4 showing increased expressions of self-preservation in longer reasoning traces.

These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns. Our results demonstrate the importance of evaluating models across diverse reasoning lengths to identify and address these failure modes in LRMs.

Demo

Explore how different models perform on inverse scaling tasks across various reasoning budgets. Compare performance between baseline (no thinking) and maximum reasoning budget to see inverse scaling effects.

Key Findings

Our evaluation reveals important insights about how test-time scaling affects model performance across different types of reasoning tasks.

Citation

@article{gema2025inverse,
  title={Inverse Scaling in Test-time Compute: When More Thinking Makes Large Language Models Worse},
  author={Gema, Aryo Pradipta and Hägele, Alexander and Chen, Runjin and Arditi, Andy and Goldman-Wetzler, Jacob and Fraser-Taliente, Cristofero and Sleight, Henry and Petrini, Linda and Michael, Julian and Alex, Beatrice and Minervini, Pasquale and Chen, Yanda and Benton, Joe and Perez, Ethan},
  journal={arXiv preprint arXiv:2025.XXXXX},
  year={2025}
}