POPL 2026
Sun 11 - Sat 17 January 2026 Rennes, France

Large language models (LLMs) have demonstrated significant potential for automating formal program verification, particularly for inductive loop invariant generation. We present DafnyPro, a novel inference-time framework for generating verification annotations in Dafny. Using Claude Sonnet 3.5, DafnyPro achieves 86% correct proofs on the challenging DafnyBench benchmark, a 16 percentage point improvement over the previous state of the art with the same model. We also introduce a suite of smaller, cost-efficient fine-tuned models: our 14B model achieves 70% correct proofs and our 7B model reaches 68% on DafnyBench. These results demonstrate that high verification accuracy can be maintained even with compact models.