POPL 2026
Sun 11 - Sat 17 January 2026 Rennes, France

We present miniF2F-Dafny, the first translation of the mathematical reasoning benchmark miniF2F to an automated theorem prover: Dafny. Previously, the benchmark existed only in interactive theorem provers (Lean, Isabelle, HOL Light, Metamath). We find that Dafny’s automation verifies 99/244 (40.6%) of the test set and 109/244 (44.7%) of the validation set with empty proofs—requiring no manual proof steps. For problems where empty proofs fail, we evaluate 14 off-the-shelf LLMs on providing proof hints. Our best model achieves 55.7% pass@4 success rate employed iterative error correction. Our results highlight an effective division of labor: rather than generating complete proofs, LLMs provide high-level guidance while automation handles low-level details.