SecureMind : A Framework for Benchmarking Large Language Models in Memory Bug Detection and Repair

Wang, Huanting and Jacob, Dejice and Kelly, David and Elkhatib, Yehia and Singer, Jeremy and Wang, Zheng (2025) SecureMind : A Framework for Benchmarking Large Language Models in Memory Bug Detection and Repair. In: Proceedings of the 2025 ACM SIGPLAN International Symposium on Memory Management :. Proceedings of the 2025 ACM SIGPLAN International Symposium on Memory Management . ACM, New York, pp. 27-40. ISBN 9798400716102

Full text not available from this repository.

Abstract

Large language models (LLMs) hold great promise for automating software vulnerability detection and repair, but ensuring their correctness remains a challenge. While recent work has developed benchmarks for evaluating LLMs in bug detection and repair, existing studies rely on handcrafted datasets that quickly become outdated. Moreover, systematic evaluation of advanced reasoning-based LLMs using chain-of-thought prompting for software security is lacking. We introduce SecureMind, an open-source framework for evaluating LLMs in vulnerability detection and repair, focusing on memory-related vulnerabilities. SecureMind provides a user-friendly Python interface for defining test plans, which automates data retrieval, preparation, and benchmarking across a wide range of metrics. Using SecureMind, we assess 10 representative LLMs, including 7 state-of-the-art reasoning models, on 16K test samples spanning 8 Common Weakness Enumeration (CWE) types related to memory safety violations. Our findings highlight the strengths and limitations of current LLMs in handling memory-related vulnerabilities.

Item Type:
Contribution in Book/Report/Proceedings
Uncontrolled Keywords:
/dk/atira/pure/subjectarea/asjc/1700/1708
Subjects:
?? bug repairlarge language modelssoftware bug detectionhardware and architecturesoftware ??
ID Code:
231001
Deposited By:
Deposited On:
13 Feb 2026 15:05
Refereed?:
Yes
Published?:
Published
Last Modified:
13 Feb 2026 23:30