WG211/M25Barke
Toward Trustworthy LLM Generation: From Verifiable Proofs to Reliable Agents
Verifiable programming provides a concrete foundation for trustworthy LLM generation. This talk introduces DeepProof-32B, a 32-billion-parameter reasoning model designed for program verification and trained with reinforcement learning from verifiable rewards, achieving state-of-the-art performance among open models. The first part distills lessons from building models that can reason about their own steps and shows how these principles extend naturally to domains such as test-case generation and program optimization.
The second part builds this foundation by examining how the same requirements for traceability, verifiability, and structured reasoning apply when LLMs act as agents. Real-world deployments in retail, airline operations, incident management, and Microsoft 365 Copilot demand not only correct generation, but interpretable decision-making. I present a debugging framework that analyzes multi-step agent trajectories, introduces a taxonomy of failure modes, and describes a “library learning” technique that synthesizes reusable reasoning abstractions on the fly. Together, these two threads chart a path toward the next generation of accurate, reliable, and safe LLM systems.