We present a benchmark for vericoding, generation of formally verified code from formal specifications — in contrast to vibe coding, which generates potentially buggy code from natural language descriptions. Rapid AI progress has popularized LLM-based generation of computer programs from natural language descriptions. Unfortunately, the resulting code can be buggy, and traditional test case analysis can typically only demonstrate the presence and not the absence of bugs. Fortunately, rigorous correctness guarantees can be created via formal verification, by generating a machine-checkable proof that code meets its human-written specifications.
To support automation of formal verification and vericoding, we present an extensive suite of formal specifications for Lean, Rust/Verus and Dafny. We discuss how we assembled and curated these tasks, as well as vericoding results from straightforward LLM-prompting experiments
Valentina Wu Faculdade de Engenharia, Universidade do Porto, Alexandra Mendes Faculty of Engineering, University of Porto & INESC TEC, Alexandre Abreu University of Porto & INESC TEC