Large language models (LLMs) exhibit a striking ability to learn and adapt to new tasks on the fly—commonly referred to as in-context learning. Without modifying their underlying parameters, these models can generalize from just a few examples, a trait that has drawn comparisons to implicit Bayesian updating. However, emerging research highlights a key theoretical contradiction: transformers, the foundational architecture behind most LLMs, consistently violate the martingale property—a central requirement for Bayesian reasoning when dealing with exchangeable data. This insight invites a deeper exploration into the mathematical assumptions underlying modern AI systems, particularly in scenarios where quantifiable uncertainty is essential.
A recent theoretical analysis offers a nuanced view into the inner workings of transformers and how they handle uncertainty. Among the findings, positional encodings—integral to transformers’ sequence awareness—are shown to be a primary source of deviation from Bayesian expectations, leading to martingale violations of logarithmic order. Interestingly, despite these violations, the models still achieve information-theoretic optimality in expected prediction risk. This suggests that transformers are not failing to reason probabilistically, but rather are operating via a fundamentally different, yet still highly efficient, mechanism.
One particularly notable contribution is the derivation of the optimal length for chain-of-thought reasoning, presenting a concrete formula that balances computational cost with inference quality. Empirical tests on GPT-3 align closely with these theoretical predictions, showing near-perfect entropy efficiency within only 20 demonstration examples. These insights not only strengthen our understanding of how LLMs process new information, but also introduce practical tools for improving performance and reliability in real-world deployments. For developers and researchers alike, these findings present a compelling framework to rethink model interpretability, inference costs, and uncertainty quantification.