Our Sudoku demo is a simple way to see the difference yourself.
We have Kona solve hard sudoku puzzles in real-time alongside a group of popular LLMs, including GPT-5.2, Claude Opus 4.5 and Sonnet 4.5, Gemini 3 Pro, and DeepSeek V3.2. After about a week of public access, Kona solved 96.2% of puzzles at an average of 313 milliseconds. Frontier LLMs together had a solve rate of only 2%, taking up to 90 seconds before an incorrect answer or time out.
The reason LLMs struggle with sudoku is not that the rules are especially complicated; A sudoku cell is 81 cells arranged in a 9x9 grid, where each row, column, and 3x3 box must contain the digits 1 through 9 exactly once. But LLM architecture isn’t well-suited for spatial tasks; solving these puzzles requires holding the whole grid in mind and reasoning about how a change in one cell affects every other cell in the same row, column, and box.
Because LLMs generate tokens sequentially, committing to each one as they go, they cannot easily revise earlier decisions when they discover a conflict later. This is not a challenge that can be solved by adding more compute or seeing more sudoku puzzles - it is an inherent structural limitation on autoregressive language models.
See the live benchmark, at sudoku.logicalintelligence.com.
We disabled code execution for all models in the demo, to prevent the LLMs from writing a brute-force backtracking solver in Python and running it, which would tell us something about their coding ability but nothing about their reasoning ability. With code execution off, the models have to actually work through the puzzle using whatever reasoning capabilities they have, and those capabilities turn out to be poorly suited to constraint satisfaction.
We also did not fine-tune these LLMs on sudoku problems. Fine-tuning would test pattern recognition over memorized solutions, not reasoning. Neither was Kona trained on solved puzzles; it learns the constraint structure and solves novel configurations from that information alone. The comparison is between reasoning architectures, not between levels of domain-specific exposure.
The LLMs receive a puzzle and start producing chains of reasoning: "Let me analyze by rows, columns and boxes," "for cell (4,1), the row needs 1, 2, 3, 5, 7, 8, 9," and so on, sometimes generating hundreds of tokens of step-by-step analysis. Then they output a grid that has duplicate digits in a row, or a missing digit in a column, or some other constraint violation. DeepSeek V3.2 tends to finish quickly but with many duplicates across rows, columns, and boxes - failing to solve the puzzle. The Claude models often think for a long time and then produce something close but still wrong. GPT-5.2 does best among the LLMs at 6.9%, which is still a 93% failure rate on a puzzle that any patient human can solve.
Kona approaches the problem differently because it's a different kind of model. Instead of generating a solution token-by-token, Kona produces a complete candidate grid and evaluates it against all the constraints simultaneously, assigning a scalar energy score where low energy means the constraints are satisfied and high energy means something is violated. If the energy is high, Kona can identify which constraints are broken and revise the relevant parts of the grid without starting over, using gradient information in continuous latent space to move toward valid configurations. This process takes about 313 milliseconds on average, compared to the 30+ seconds the LLMs spend generating reasoning traces that lead nowhere.
The cost difference is also notable. Running Kona on the nearly 13,000 puzzles it has actually solved has so far cost about $4 in GPU time, and roughly 1.1 hours of compute. The LLM API calls for the same demo at a 98% failure rate have cost around $11,000, mostly from Claude's tokens.
If a model architecture can’t handle sudoku, where the constraints are simple and the state space is tiny, it is not going to handle supervising industrial control systems or billions of dollars in global transactions. The gap between 96% and 2% is not about sudoku - it’s a demonstration of how completely different these architectures are designed.

