CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

Dachuan Shi¹ Hanlin Zhu² Xiangchi Yuan¹ Wanjia Zhao³ Kejing Xia¹ Wen Xiao⁴ Wenke Lee¹

¹Georgia Tech ²UC Berkeley ³Stanford University ⁴Microsoft

Up to+23%peak accuracy at the same or lower token usage

Up to-57%token usage at the same or higher accuracy

AcrossMath & Coding
& Agenticreasoning benchmarks

CopT overview: answer-first pipeline, continuous-space verifier, and accuracy-token tradeoff — (a) Conceptual comparison between CoT thinking and CopT on-policy thinking. (b) CopT contrasts the output distributions under discrete and continuous inputs. (c) CopT improves peak accuracy, marked by *, across mathematics, coding, and agentic reasoning tasks and nearly halves token usage at matched accuracy.

TL;DR

CopT is a pipeline with continuous-space verifiers for math, coding, and agentic reasoning, enabling LLMs to start with a draft answer and perform on-policy thinking conditioned on it for reflection and correction.

CopT Workflow

CopT methodology figure showing draft-answer reliability estimation and on-policy thinking with dynamic draft visibility — **Overview of CopT workflow.** CopT starts with a draft answer and performs on-policy thinking conditioned on it. It contrasts the model's support for the same chosen tokens under discrete and continuous inputs to estimate draft answer reliability, and during on-policy thinking, chunk by chunk, to determine the visibility of the draft answer across time steps.

Reliability Estimator w/ Continuous Contrasts

\[ \kappa_a(a_{1:T_a}) := \frac{1}{T_a}\sum_{t=1}^{T_a} \big[\log p_\theta(a_t\mid q,a_{\lt t})-\log p_\theta(a_t\mid q,e_{\lt t})\big]. \]

Contrast the model's support for the same generated tokens under discrete-token and continuous-embedding inputs to estimate answer reliability via sequence-level KL.

On-Policy Thinking Elicitation

\[ \kappa_a>\tau_a \Rightarrow r\sim p_\theta(r\mid a,q) \]

Perform subsequent on-policy thinking conditioned on the draft answer for reflection and correction when the draft answer is deemed insufficiently reliable.

Stability Estimator w/ Continuous Contrasts

\[ \begin{aligned} \kappa_r^{(k)}(r_{s_k:s_k+C-1}) := \frac{1}{C}\sum_{t=s_k}^{s_k+C-1}\big[&\log p_\theta(r_t\mid q,a^{(m_k)},r_{T_a+1:t-1})\\ &-\log p_\theta(r_t\mid q,a^{(m_k)},r_{T_a+1:s_k-1};e_{s_k:t-1})\big]. \end{aligned} \]

Assess whether thinking chunks remain stable under continuous inputs with a second sequence-level KL.

Draft Answer Visibility Control

\[ m_{k+1}=\begin{cases} 1, & \kappa_r^{(k)}<\tau_r,\\ 0, & \text{otherwise.} \end{cases} \]

Dynamically expose the draft answer during on-policy thinking to preserve useful partial information while reducing the risk of being misled by unreliable content.

Key Results

Math & STEM, up to +3.34%

Accuracy at the same or lower token usage

Coding, up to +6.67%

Accuracy at the same or lower token usage

Agentic, up to +23.03%

Accuracy at the same or lower token usage

Token Usage, up to -55% / -57% / -45%

At the same or higher Math / Coding / Agentic reasoning accuracy

Key Notes

Reverse Order: CopT reverses the usual order of thinking and answering by eliciting a draft answer and performing on-policy thinking conditioned on it. This provides earlier access to answers and avoids unnecessary token consumption.
Continuous Verifier: Unlike methods that directly use continuous embeddings as a medium for generation, CopT recasts them as inference-time verifiers. This allows CopT to use uncertainty information preserved by continuous embeddings as in latent reasoning while retaining the readability of explicit reasoning.
Overhead: The estimators are calculated from already generated sequences, and the corresponding probabilities are obtained in parallel with a single forward pass. Therefore, CopT incurs only small overhead once the chosen-token probabilities and continuous embeddings are cached online during generation.
Theoretical Analysis: The expected estimate equals the mutual information between the unresolved latent state and the emitted answer token. In other words, CopT measures answer-relevant uncertainty rather than uncertainty over latent reasoning states themselves.

BibTeX

@misc{shi2026coptcontrastiveonpolicythinking,
      title={CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning}, 
      author={Dachuan Shi and Hanlin Zhu and Xiangchi Yuan and Wanjia Zhao and Kejing Xia and Wen Xiao and Wenke Lee},
      year={2026},
      eprint={2605.20075},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.20075}, 
}