Recently, Large Language Models (LLMs) have been widely applied to code generation, and recent studies have explored their ability to produce programs in verification languages, including Dafny. Although several benchmarks have been proposed to assess LLM performance in generating Dafny code, existing datasets suffer from several limitations, including no clear verification goals, limited diversity in verification complexity, potential contamination, and heavy reliance on manually collected examples that are publicly available online. In this work, we introduce TACoDafny, an automated pipeline for constructing Dafny verification benchmarks directly from task descriptions sourced from online programming repositories. As a preliminary result, we successfully generated 76 Dafny programs with verifiable properties derived from the task descriptions of the first 200 LeetCode problems. Our pipeline not only enables large-scale benchmark generation, but also facilitates the creation of additional verifiable Dafny code that can be leveraged to train models to further improve their verification capabilities.
Valentina Wu Faculdade de Engenharia, Universidade do Porto, Alexandra Mendes Faculty of Engineering, University of Porto & INESC TEC, Alexandre Abreu University of Porto & INESC TEC