Wang, Huanting and Jacob, Dejice and Kelly, David and Elkhatib, Yehia and Singer, Jeremy and Wang, Zheng (2025) SecureMind : A Framework for Benchmarking Large Language Models in Memory Bug Detection and Repair. In: Proceedings of the 2025 ACM SIGPLAN International Symposium on Memory Management :. Proceedings of the 2025 ACM SIGPLAN International Symposium on Memory Management . ACM, New York, pp. 27-40. ISBN 9798400716102
Full text not available from this repository.Abstract
Large language models (LLMs) hold great promise for automating software vulnerability detection and repair, but ensuring their correctness remains a challenge. While recent work has developed benchmarks for evaluating LLMs in bug detection and repair, existing studies rely on handcrafted datasets that quickly become outdated. Moreover, systematic evaluation of advanced reasoning-based LLMs using chain-of-thought prompting for software security is lacking. We introduce SecureMind, an open-source framework for evaluating LLMs in vulnerability detection and repair, focusing on memory-related vulnerabilities. SecureMind provides a user-friendly Python interface for defining test plans, which automates data retrieval, preparation, and benchmarking across a wide range of metrics. Using SecureMind, we assess 10 representative LLMs, including 7 state-of-the-art reasoning models, on 16K test samples spanning 8 Common Weakness Enumeration (CWE) types related to memory safety violations. Our findings highlight the strengths and limitations of current LLMs in handling memory-related vulnerabilities.
Altmetric
Altmetric