Towards robust software vulnerability detection : exploring machine learning models, datasets, and explainability

Debeyan, Fahad and Hall, Tracy (2025) Towards robust software vulnerability detection : exploring machine learning models, datasets, and explainability. PhD thesis, Lancaster University.

[thumbnail of 2025debeyanphd]
Text (2025debeyanphd)
2025debeyanphd.pdf - Published Version
Restricted to Repository staff only until 31 December 2025.
Available under License Creative Commons Attribution-NonCommercial-NoDerivs.

Download (8MB)

Abstract

Software vulnerability detection and prediction aim to reduce the costs associated with identifying software security vulnerabilities. Studies have shown that many vulnerability prediction models underperform in practical applications compared to the performance results reported on vulnerability prediction datasets. This performance drop can be attributed to datasets containing synthetic or biased data. Additionally, most vulnerability prediction models operate on a binary classification basis (vulnerable or nonvulnerable), leaving developers to determine the context of vulnerabilities, such as identifying vulnerable lines of code and specific vulnerability types. Aims: This thesis aims to enhance the robustness of software vulnerability detection. To achieve this aim, I will explore ways to improve vulnerability prediction models by generating fine-grained predictions and increasing model accuracy. Furthermore, I will investigate methods to enhance the quality of vulnerability prediction datasets by identifying biases in existing datasets. Lastly, I will examine the impact of explanations on the ability of software practitioners to validate detected software vulnerabilities (i.e., true positive vulnerabilities) and to fix them correctly. Methods: I propose a novel approach to cluster software vulnerability types using abstract syntax tree (AST) N-grams as model features. I trained various vulnerability prediction models on training sets with different ratios of `easy negatives' (very different from positive data) and `hard negatives' (closely similar to positive data). These models were then evaluated on test sets comprising entire projects. Additionally, I utilized eXplainable AI (XAI) to obtain line-level attributions from LineVul, a state-of-the-art model and compared these attributions to actual vulnerable lines. Finally, I surveyed 99 software practitioners to assess the effect of four types of vulnerability explanations on their ability to validate and fix vulnerabilities correctly. Results: Using a random forest model with AST N-grams as features, I successfully clustered seven types of vulnerabilities, achieving a Matthews Correlation Coefficient (MCC) of up to 81%. I discovered that the ratio of easy to hard negatives in a vulnerability prediction dataset significantly impacts model performance. When evaluating entire projects, models trained on datasets with more easy negatives performed better, reaching a performance plateau at a ratio of 15 easy negatives per vulnerable instance. Through XAI, I enhanced the MSR dataset and the LineVul model, increasing both F-measure and MCC from 92% to 97% and 96%, respectively. Additionally, vulnerability explanations were found to assist developers in validating and correctly fixing vulnerabilities, with short-form text-based explanations being more effective and preferred by software practitioners. Lastly, I observed that software practitioners are willing to accept some reduction in detection accuracy in exchange for improved explainability. Conclusions: This thesis presents an approach to generate finer-grained software vulnerability predictions by clustering vulnerability types. I introduced the concept of easy and hard negatives in vulnerability prediction datasets, offering a deeper understanding of what constitutes a high-quality dataset. Utilizing XAI, I identified two biases in a widely used vulnerability prediction dataset and uncovered a limitation in a state-of-the-art model. Finally, I provided valuable insights into the types of explanations that developers find useful, guiding future research and tool development towards more effective vulnerability explanations.

Item Type:
Thesis (PhD)
Uncontrolled Keywords:
Research Output Funding/yes_externally_funded
Subjects:
?? yes - externally fundedno ??
ID Code:
229531
Deposited By:
Deposited On:
29 May 2025 09:00
Refereed?:
No
Published?:
Published
Last Modified:
29 May 2025 09:00