Reliability Estimator
Contrast discrete-token and continuous-embedding inputs to estimate answer reliability.
CopT is a reasoning pipeline with continuous-space verifiers, enabling LLMs to start with a draft answer and invoke on-policy thinking conditioned on it.
Contrast discrete-token and continuous-embedding inputs to estimate answer reliability.
Invoke subsequent on-policy thinking conditioned on the draft answer for reflection and correction when the answer is deemed insufficiently reliable.
A second estimator for assessing whether thinking chunks remain stable under continuous inputs.
Dynamically expose the draft answer during on-policy thinking to preserve useful partial information while reducing the risk of being misled by unreliable content.
Accuracy at the same or lower token usage.
Accuracy at the same or lower token usage.
Accuracy at the same or lower token usage.
At the same or higher Math / Coding / Agentic reasoning accuracy.
@misc{shi2026coptcontrastiveonpolicythinking,
title={CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning},
author={Dachuan Shi and Hanlin Zhu and Xiangchi Yuan and Wanjia Zhao and Kejing Xia and Wen Xiao and Wenke Lee},
year={2026},
eprint={2605.20075},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.20075},
}