Debating If LLM Reasoning Is "Actually Reasoning" Is Meaningless

Posted on Aug 10, 2025

tl;dr: LLM reasoning = Decoding candidates with more tokens + Better performance

Ever since chain-of-thought (CoT) prompting ¹ was proposed, one of the most heated debates in AI has been whether large language models (LLMs) can truly reason. This debate is meaningless without a clear definition of reasoning. In our context, what we call Reasoning LLMs simply refers to models generating more intermediate tokens before reaching final answers ². Nothing more is promised.

LLMs, as a type of probabilistic model, learn to model the distribution of words (tokens) during pre-training and post-training. At inference time, they decode tokens autoregressively after pre-filling an input prompt with their decoder-only Transformers. What we aim for at test time is primarily this objective:

$$\argmax \mathbb{P}(\text{answer}\ |\ \text{problem})$$

Both theoretical ³ and empirical ²,⁴,⁵ studies have validated that by decoding more tokens at test time, via various methods (best-of-N, CoT prompting ¹, thinking ²), LLMs deliver stronger performance, demonstrating the well-known test-time scaling law ⁶.

One often-overlooked fact is that LLMs, as probabilistic models, are already capable of generating CoT-format responses right after pre-training. The only requirement is moving beyond simple greedy decoding ⁷, as CoT-format responses naturally exist within pre-trained models.

Now the question becomes: how do we reshape the decoding distribution to more easily obtain (long) CoT responses? Recent emergent post-training approaches ⁴,⁵,⁸ provide the answers by optimizing this objective:

$$\argmax \mathbb{P}(\text{long CoT}, \text{answer}\ |\ \text{problem})$$

Hereby, we bring intermediate long CoT to the forefront before final answers - content that was present but hidden during pre-training and is now brought onstage. This is also called LLM reasoning, only because its CoT format resembles the step-by-step reasoning patterns of humans, and it works pretty well.

The author kindly argues that, instead of debating whether this is “actual reasoning,” it might be more preferable to build more useful things in practical and understand them rigorously.

References

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V. and Zhou, D., 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, pp.24824-24837. ↩︎ ↩︎
OpenAI Blog, 2024. Learning to Reason with LLMs. ↩︎ ↩︎ ↩︎
Li, Z., Liu, H., Zhou, D. and Ma, T., 2024. Chain of thought empowers transformers to solve inherently serial problems. arXiv preprint arXiv:2402.12875, 1. ↩︎
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X. and Zhang, X., 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. ↩︎ ↩︎
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C. and Zheng, C., 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388. ↩︎ ↩︎
Snell, C., Lee, J., Xu, K. and Kumar, A., 2024. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314. ↩︎
Wang, X. and Zhou, D., 2024. Chain-of-thought reasoning without prompting. Advances in Neural Information Processing Systems, 37, pp.66383-66409. ↩︎
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y. and Guo, D., 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. ↩︎