Inverse Scaling in Test-time Compute

Aryo Pradipta Gema AFP

² • Alexander Hägele AFP

³ • Runjin Chen AFP

⁴ • Andy Arditi AFP

Jacob Goldman-Wetzler AFP

• Kit Fraser-Taliente AFP

• Henry Sleight⁵ • Linda Petrini⁶

Julian Michael^7,* • Beatrice Alex² • Pasquale Minervini^2,8

Yanda Chen¹ • Joe Benton¹ • Ethan Perez¹

Anthropic Fellows Program • ¹Anthropic • ²University of Edinburgh • ³EPFL • ⁴University of Texas at Austin
⁵Constellation • ⁶Independent • ⁷Scale AI • ⁸MiniML.AI
* Now at Meta

Corresponding authors: aryo.gema@ed.ac.uk, ethan@anthropic.com

📄 Preprint (arXiv) Soon 💻 Code 🤗 Hugging Face Dataset Soon 📑 BibTeX

Abstract

We construct evaluation tasks where extending reasoning length deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four categories: red herring tasks with embedded distractors, spurious correlation tasks, constraint satisfaction tasks, and advanced AI risks.

Our analyses show that different Large Reasoning Models (LRMs) exhibit distinct failure modes: Claude models become increasingly distracted by irrelevant information in a given prompt as they reason longer; OpenAI o-series models resist distractors but shows pronounced overfitting to problem framings; in spurious correlation tasks, extended reasoning causes models to shift from reasonable priors to plausible but incorrect features, though providing few-shot examples largely corrects this behavior; in constraint satisfaction tasks, all models show performance degradation with extended reasoning, suggesting difficulties in maintaining focus during complex deductive tasks; and in AI safety evaluation tasks, we find that extended reasoning can amplify model-specific concerning behaviors, with Claude Sonnet 4 showing increased expressions of self-preservation in longer reasoning traces.

These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns. Our results demonstrate the importance of evaluating models across diverse reasoning lengths to identify and address these failure modes in LRMs.

Demo

Explore how different models perform on inverse scaling tasks across various reasoning budgets. Compare performance between baseline (no thinking) and maximum reasoning budget to see inverse scaling effects.

Model

Task

Key Findings

Our evaluation reveals important insights about how test-time scaling affects model performance across different types of reasoning tasks.

Inverse Scaling in Test-time Compute

Abstract

Demo

Key Findings

Citation