CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

1Georgia Tech 2UC Berkeley 3Stanford University 4Microsoft
Up to+23%peak accuracy at the same or lower token usage
Up to-57%token usage at the same or higher accuracy
AcrossMath & Coding
& Agentic
reasoning benchmarks
CopT overview: answer-first pipeline, continuous-space verifier, and accuracy-token tradeoff
(a) Conceptual comparison between CoT thinking and CopT on-policy thinking. (b) CopT contrasts the output distributions under discrete and continuous inputs. (c) CopT improves peak accuracy, marked by *, across mathematics, coding, and agentic reasoning tasks and nearly halves token usage at matched accuracy.
TL;DR

CopT is a reasoning pipeline with continuous-space verifiers, enabling LLMs to start with a draft answer and invoke on-policy thinking conditioned on it.

CopT Workflow

CopT methodology figure showing draft-answer reliability estimation and on-policy thinking with dynamic draft visibility
CopT starts with a draft answer and performs on-policy thinking conditioned on it. It contrasts the model's support for the same chosen tokens under discrete and continuous inputs to estimate draft answer reliability, and during on-policy thinking, chunk by chunk, to determine the visibility of the draft answer across time steps.
01

Reliability Estimator

\[ \kappa_a(a_{1:T_a}) := \frac{1}{T_a}\sum_{t=1}^{T_a} \left[\log p_\theta(a_t\mid q,a_{\lt t})-\log p_\theta(a_t\mid q,e_{\lt t})\right]. \]

Contrast discrete-token and continuous-embedding inputs to estimate answer reliability.

02

On-Policy Thinking

\[ \kappa_a>\tau_a \Rightarrow r\sim p_\theta(r\mid a,q) \]

Invoke subsequent on-policy thinking conditioned on the draft answer for reflection and correction when the answer is deemed insufficiently reliable.

03

Stability Estimator

\[ \begin{aligned} \kappa_r^{(k)}(r_{s_k:s_k+C-1}) := \frac{1}{C}\sum_{t=s_k}^{s_k+C-1} \big[&\log p_\theta(r_t\mid q,a^{(m_k)},r_{T_a+1:t-1})\\ &-\log p_\theta(r_t\mid q,a^{(m_k)},r_{T_a+1:s_k-1};e_{s_k:t-1})\big]. \end{aligned} \]

A second estimator for assessing whether thinking chunks remain stable under continuous inputs.

04

Visibility Control

\[ m_{k+1}=\begin{cases} 1, & \kappa_r^{(k)}<\tau_r,\\ 0, & \text{otherwise.} \end{cases} \]

Dynamically expose the draft answer during on-policy thinking to preserve useful partial information while reducing the risk of being misled by unreliable content.

Key Results

Math & STEM, up to +3.34%

Accuracy at the same or lower token usage.

Coding, up to +6.67%

Accuracy at the same or lower token usage.

Agentic, up to +23.03%

Accuracy at the same or lower token usage.

Token Usage, up to -55% / -57% / -45%

At the same or higher Math / Coding / Agentic reasoning accuracy.

Key Notes

  • Reverse Order: CopT reverses the usual order of thinking and answering by eliciting a draft answer and invoking on-policy thinking conditioned on it. This provides earlier access to answers and avoids unnecessary token consumption.
  • Continuous Verifier: Unlike methods that directly use continuous embeddings as a medium for generation, CopT recasts them as inference-time verifiers. This allows CopT to use uncertainty information preserved by continuous embeddings as in latent reasoning while retaining the readability of explicit reasoning.
  • Overhead: The estimators are calculated from already generated sequences, and the corresponding probabilities are obtained in parallel with a single forward pass. Thus, it incurs only small overhead once the chosen-token probabilities and continuous embeddings are cached online during generation.
  • Overhead: The estimators are calculated from already generated sequences, and the corresponding probabilities are obtained in parallel with a single forward pass. Therefore, it incurs only small overhead once the chosen-token probabilities and continuous embeddings are cached online during generation.
  • Theoretical Analysis: The expected estimate equals the mutual information between the unresolved latent state and the emitted answer token. In other words, CopT measures answer-relevant uncertainty rather than uncertainty over latent reasoning states themselves.

BibTeX

@misc{shi2026coptcontrastiveonpolicythinking,
      title={CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning}, 
      author={Dachuan Shi and Hanlin Zhu and Xiangchi Yuan and Wanjia Zhao and Kejing Xia and Wen Xiao and Wenke Lee},
      year={2026},
      eprint={2605.20075},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.20075}, 
}