The Paradox of Outcome Optimization: A Causal Information-Theoretic Bound on Reasoning Shortcuts in LLMs

Zihan Chen^*1,2, Yiming Zhang^*1,2, Wenxiang Geng¹, Zenghui Ding^†1, Yining Sun^1,2

¹HFIPS, Chinese Academy of Sciences, ²University of Science and Technology of China

ACL 2026 Main Conference
^*Equal Contribution ^†Corresponding Author

Overview figure for The Paradox of Outcome Optimization and the Causal Solution — **Figure 1: Overview.** The figure summarizes the paradox of outcome optimization, the reward-induced shortcut collapse mechanism, and the role of process supervision as the causal solution.

Abstract

Large Language Models (LLMs) aligned via outcome-based Reinforcement Learning (RL) frequently exhibit a critical failure mode: they achieve high performance on in-distribution benchmarks while demonstrating brittle reasoning capabilities on out-of-distribution (OOD) tasks. We term this phenomenon Reward-Induced Manifold Collapse. In this paper, we establish a rigorous theoretical framework bridging Structural Causal Models (SCM) and the Information Bottleneck (IB) principle to explain this paradox. We formally define reasoning as a high-complexity causal process and shortcut learning as the exploitation of low-complexity spurious correlations. We show that under the implicit inductive bias of Stochastic Gradient Descent (SGD), models optimized for outcome rewards are biased toward shortcut solutions whenever the training distribution allows for a "Markovian Screening" of the true causal mechanism. Furthermore, we derive a new generalization bound based on Semantic Coverage Measure (η) rather than sample size, theoretically showing why data scaling on homogeneous distributions may fail to correct reasoning flaws. Finally, we show that Process Reward Models (PRMs) function as Topological Filters, enforcing step-wise mutual information constraints that render the low-complexity shortcut manifold inadmissible. Our theoretical contributions provide a mathematical grounding for the role of process supervision beyond simple credit assignment.

Theorem 1: Shortcut Screening

If the training reward is already explained by shortcut features and those shortcuts are easier to learn than causal features, outcome-based training is driven toward shortcut-dominated representations.

Theorem 2: Semantic Coverage Bound

Robust OOD error is controlled by semantic coverage rather than raw sample count, so scaling data alone does not remove failure when the training distribution stays homogeneous.

Theorem 3: PRM as Topological Filter

Process rewards enforce valid intermediate transitions, turning shortcut solutions into high-loss regions and pushing optimization toward causally consistent reasoning paths.

Theorem Gallery

Rendered Theorem Snapshots

Scroll through the rendered LaTeX snapshots for the three central theoretical results.

Theorem 1 screenshot: Reward-Induced Manifold Collapse

Theorem 1. Reward-Induced Manifold Collapse: shortcut features that already explain training reward can dominate optimization and screen out causal features.

Theorem 2 screenshot: Coverage-Dependent Generalization Error

Theorem 2. Coverage-Dependent Generalization Error: robust OOD risk is bounded by semantic coverage rather than raw sample count alone.

Theorem 3 screenshot: Topological Separation via Step-wise Mutual Information

Theorem 3. Topological Separation via Step-wise Mutual Information: process rewards act as filters that exclude shortcut manifolds from the optimal set.

Experimental Results

Selected Result Figures

Two representative result figures are shown below to complement the theorem gallery with empirical evidence.

Experimental result figure 1 — **Result Figure 1.** Main experimental result.

Experimental result figure 2 — **Result Figure 2.** Additional benchmark or ablation result.

Poster

Poster Is Coming Soon

A poster version of this work will be added here in a future update.

Poster is coming soon.