Developing Novel Statistical Modelling Frameworks for the Study of Early Literacy Acquisition in Game-Based Learning Environments

Ma, Yawen and Ushakova, Anastasia and Cain, Kate and Wallin, Gabriel (2026) Developing Novel Statistical Modelling Frameworks for the Study of Early Literacy Acquisition in Game-Based Learning Environments. PhD thesis, Lancaster University.

[thumbnail of 2026YawenMaPhD]
Text (2026YawenMaPhD)
2026YawenMaPhD.pdf - Published Version
Restricted to Repository staff only until 8 May 2028.
Available under License Creative Commons Attribution-NoDerivs.

Download (14MB)

Abstract

In the current era of digital technology, large amounts of data are collected as learners interact with digital educational tools designed to teach and practice new skills. These data, referred to as ‘log files’, have drawn interdisciplinary attention from researchers in psychology, education, data science, and statistics. Their values lie in offering potential insights into children’s learning processes both the micro-level of individual sessions and the macro-level of longitudinally development across multiple sessions. At the within-session level, log files capture fine-grained behaviour metrics, such as response time, answer correctness, reattempt strategies, and early exits before session completion. These metrics enable researchers to investigate real-time learning process, such as how students recover from mistakes, whether they persist through multiple failures, or disengagement reflected in repeated guessing followed by consecutive failure and early exits. When examined across sessions, log files record metrics, such as engagement patterns, skill development trajectories, session frequency, total time spent, and aggregated performance indicators (e.g., accuracy rates, reattempts rates, sum scores, and response speed). This macro-level analysis, often spanning months, semesters, or even years, can identify individual difference in engagement levels, learning trajectories, and skill transitions. This thesis investigates how log files from a digital educational app designed to develop reading skills through various individual games can be analysed to better understand reading development. The central innovation of this work is methodological—through the application and development of new statistical methods and analytical frameworks tailored to model this rich and complex data. The thesis is structured into six chapters, with each empirical chapter (Chapters 3-5) focusing on a distinct use and advancement of statistical techniques, developed through the lens of a digital reading app and applied to the log files. These methods are used to capture and analyse fine-grained information—such as the distinct reader profiles, the impact of reattempt behaviour, transitions between proficiency groups, and the effects of covariates on these transitions—to reveal how reading skills evolve over time. Moreover, the framework’s adaptability to other digital learning settings illustrates its broader utility for investing learning behaviours and learning outcomes across diverse learner populations, ultimately providing a data-driven approach to understanding and supporting skill acquisition through techniques such as clustering, latent variable modelling, and longitudinal analysis. Chapter 1 introduces the thesis, providing the rationale, a review of relevant literature, methodologies and theories, and an overview of the dataset used in this research. It provides the critical methodological foundation for the empirical chapters (Chapters 3-6). Chapter 2 is a general methodological chapter that outlines the statistical methods employed across the thesis, as well as the broader rationale for choosing those methods in light of current literature, and how they help address the thesis’s research questions. The first empirical chapter, Chapter 3, employs unsupervised learning techniques, specifically cluster analysis, to identify distinct reader groups from unstructured log files. It examines the relationships between early literacy ability, in-game behaviours, and out-of-game performance. Chapter 4 adopts sequential item response theory and mixed-effects models to generate continuous proficiency scores. These scores are used to track student performance over time across four digital games designed to support reading skills. Chapter 5 develops and applies an innovative method that extends cognitive diagnosis models into a longitudinal framework and integrates them with latent transition models incorporating individual-level covariate effects. This approach captures changes in discrete proficiency mastery status and proficiency profiles in a digital learning environment and evaluates the covariate effects on both the initial mastery status and transition probabilities over time. By adopting this dynamic framework, the chapter sheds light on profile transitions and progression during interactions with digital tools. Finally, Chapter 6 presents the general discussion, considers the limitations, outlines future research directions, and concludes the thesis. Taken together, these studies highlight the unique advantages and potential of log file data for advancing both theory and practice in early literacy education driven by cross-disciplinary research evidence. By analysing learning processes in fine detail across a range of metrics, such approaches provide insights that standard classroom assessment and monitoring cannot achieve. Beyond offering a unique perspective into skill acquisition across time, the comprehensive analyses of log file data establish a solid foundation for future research, facilitating data-driven investigations into early reading development. Moreover, the methodological contributions in this thesis may be generalisable to other digital learning contexts beyond reading.

Item Type:
Thesis (PhD)
Uncontrolled Keywords:
Research Output Funding/yes_externally_funded
Subjects:
?? yes - externally fundedyes ??
ID Code:
236975
Deposited By:
Deposited On:
11 May 2026 16:15
Refereed?:
No
Published?:
Published
Last Modified:
11 May 2026 21:40