On Friday October 17, this site was moved to a new server, https://mw.hh.se. The original address will continue to work. Whithin a week or two this site will return to the original address. /Peo HH IT-dep

WG211/M25Barke

From WG 2.11
Revision as of 20:13, 21 November 2025 by Jacques (talk | contribs) (Created page with "Toward Trustworthy LLM Generation: From Verifiable Proofs to Reliable Agents Verifiable programming provides a concrete foundation for trustworthy LLM generation. This talk introduces DeepProof-32B, a 32-billion-parameter reasoning model designed for program verification and trained with reinforcement learning from verifiable rewards, achieving state-of-the-art performance among open models. The first part distills lessons from building models that can reason about thei...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Toward Trustworthy LLM Generation: From Verifiable Proofs to Reliable Agents

Verifiable programming provides a concrete foundation for trustworthy LLM generation. This talk introduces DeepProof-32B, a 32-billion-parameter reasoning model designed for program verification and trained with reinforcement learning from verifiable rewards, achieving state-of-the-art performance among open models. The first part distills lessons from building models that can reason about their own steps and shows how these principles extend naturally to domains such as test-case generation and program optimization.

The second part builds this foundation by examining how the same requirements for traceability, verifiability, and structured reasoning apply when LLMs act as agents. Real-world deployments in retail, airline operations, incident management, and Microsoft 365 Copilot demand not only correct generation, but interpretable decision-making. I present a debugging framework that analyzes multi-step agent trajectories, introduces a taxonomy of failure modes, and describes a “library learning” technique that synthesizes reusable reasoning abstractions on the fly. Together, these two threads chart a path toward the next generation of accurate, reliable, and safe LLM systems.