THE USE OF LARGE LANGUAGE MODELS TO AUTOMATICALLY CATEGORISE USER FEEDBACK FOR GAMES

Hemingway, Callum and Hall, Tracy and Ezzini, Saad (2025) THE USE OF LARGE LANGUAGE MODELS TO AUTOMATICALLY CATEGORISE USER FEEDBACK FOR GAMES. Masters thesis, Lancaster University.

[thumbnail of 2025Callum_HemingwayMRes]
Text (2025Callum_HemingwayMRes)
PURE_Callum_Hemingway_Msc_By_Research_Dissertation_Submission.pdf - Published Version

Download (4MB)

Abstract

This study investigates the use of large language models (LLMs) to automate the categori-sation of unstructured and informal bug reports for video games, aiming to assist developers in organising their feedback more effectively. The study seeks to answer two research questions: ”How useful do developers find categories produced by Large Language Models? and ”How reliable are different popular Large Language Models at categorising unstructured and informal bug reports for games? ” A dataset of unstructured and informal bug reports was collected from the video game distribution platform Steam, a random sample of posts from this data set was then used to generate 10 primary categories and 10-sub categories based on a selected primary category. To answer the first research question video game developers were approached through the survey participant tool Prolific and instructed to rate the generated categories based on their perceived usefulness, with the option to include additional qualitative feedback for each category. From this, developers appeared to rate higher level (more vague) categories as being more useful, indicating that to developers usefulness of bug report categories is tied to the frequency in which bug reports will be assigned to the category. However, multiple developers expressed concern in the form of optional qualitative feedback that the vagueness of the categories would reduce the practicality of using them in real world scenarios. To answer the second research question, three popular LLMs were used to categorise the same sample data set. Cohen’s Kappa, a measure of Inter-Rater Reliability, was then used to compare the reliability of each model’s categorisations amongst each other and a human reviewer. The findings from this suggest that even older models can perform this task with high reliability, and newer and potentially more expensive and demanding models are not required for this task.

Item Type:
Thesis (Masters)
ID Code:
233221
Deposited By:
Deposited On:
24 Oct 2025 11:15
Refereed?:
No
Published?:
Published
Last Modified:
24 Oct 2025 11:15