Automated Proof Generation for Rust Code via Self-Evolution
“we introduce SAFE, a novel framework that overcomes the lack of human-written proof […] [SAFE achieves] a 70.50% accuracy rate in a benchmark crafted by human experts, [vs] GPT-4o's performance of 24.46%”
arxiv.org/abs/2410.15756
Posts by AI + FM papers
Can LLMs Enable Verification in Mainstream Programming?
“… we explore the ability of LLMs to produce verified code in three verification languages (Dafny, Nagini, and Verus) […] we use manually curated datasets derived from the state-ofthe-art Python benchmark, HumanEval”
arxiv.org/abs/2503.14183
Formal Verification is Overrated
“Zac Hatfield-Dodds [argues] that relying solely on verification methods may not provide real AI safety”
youtu.be/bs5snugP1VA?...
Proving the Coding Interview: A Benchmark for Formally Verified Code Generation
“We introduce the Formally Verified Automated Programming Progress Standards, or FVAPPS, a benchmark of 4715 samples […] including 1083 curated and quality controlled samples”
arxiv.org/abs/2502.05714
Super excited: my new @darpa program on AI for pure mathematics!
Exponentiating Mathematics (expMath) aims to accelerate the rate of progress in pure math through the development of an AI collaborator and new professional-level math benchmarks.
sam.gov/opp/4def3c13...
LLM-Assisted Static Analysis for Detecting Security Vulnerabilities
"[We combine] LLMs with static analysis to perform whole-repository reasoning for security vulnerability detection. [...] IRIS leverages LLMs to infer taint specifications and perform contextual analysis"
arxiv.org/abs/2405.17238
AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement
"AlphaVerus [is] a self-improving framework that bootstraps formally verified code generation by iteratively translating programs from a higher-resource language
arxiv.org/abs/2412.06176
VerifAI: AI Verification in the Wild @ ICLR 2025
"This workshop explores the intersection of scale-driven generative artificial intelligence (AI) and the correctness-focused principles of verification."
verifai-workshop.github.io
Laurel: Generating Dafny Assertions Using Large Language Models
"...we propose Laurel, a tool that uses LLMs to automatically generate helper assertions for Dafny [...] Laurel is able to generate over 50% of the required helper assertions given only a few attempts"
arxiv.org/abs/2405.16792
Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs - “we introduce […] a method that maps informal proofs to formal proof sketches, and uses the sketches to guide an automated prover by directing its search to easier sub-problems”
arxiv.org/abs/2210.12283
Sounds like essentially the LLM has a set of operations it can perform, and the policy constrains these operations to the ones that are reasonable. Eg “user can’t buy a ticket for less than <minimum price>”
Looks like the process is (1) free text input of developer’s policy docs, (2) translation (by an LLM?) into a candidate set of SAT-checkable policy constraints, (3) audit of policies by developers, (4) auto-enforcement of policies on incoming operations generated by LLM from user interaction
“Today, we’re adding Automated Reasoning checks (preview) as a new safeguard in Amazon Bedrock Guardrails to help you mathematically validate the accuracy of responses generated by large language models (LLMs)”
aws.amazon.com/blogs/aws/pr...
Since all my Twitter content is now gone, I will start reposting some of it here. Here are the slides for my talk on the coming wave of ML-accelerated formal methods, given at the Isaac Newton Institute last month. May interest some of you.
drive.google.com/file/d/1ybQx...
Previously #2, another SQLite3 bug discovered by Team Atlanta on DARPA AIxCC: team-atlanta.github.io/blog/post-as...
They target variants of known bugs: “By providing a starting point […] we remove a lot of ambiguity from vulnerability research, and start from a concrete, well-founded theory: "This was a previous bug; there is probably another similar one somewhere"
Big Sleep: Google’s Proj Zero team find a real vulnerability in SQLite using an LLM-based agent googleprojectzero.blogspot.com/2024/10/from...
"Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming" - LLM-based proof generation for F*, and a 600K LoC dataset of F* programs and proofs, suitable for ML applications. Impressive results synthesizing real-world proofs about programs!
arxiv.org/abs/2405.01787
Recent NSF award with several of the same authors: "Aligning Code-Generating Models with Formal Specifications" www.nsf.gov/awardsearch/...
Grammar-Aligned Decoding - "[We propose] a decoding algorithm that guarantees the output to be grammatical while provably producing outputs that match the conditional probability of the LLM's distribution conditioned on the given grammar constraint" arxiv.org/abs/2405.21047 @lorisdanto.bsky.social