Evaluating Understanding in Cross-modal Multi-encoder Models}

Mander, Stephen and Piao, Scott and Rahmani, Hossein (2025) Evaluating Understanding in Cross-modal Multi-encoder Models}. PhD thesis, Computing and Communications.

[thumbnail of 2025manderphd]
Text (2025manderphd)
Stephen_Mander_PhD_Thesis_21_.pdf - Published Version
Available under License Creative Commons Attribution.

Download (36MB)

Abstract

Ensuring that AI benefits all, not just the anglo-phonic, nor the big companies, and remains sustainable covers a multitude of different problems in computing. Low-resource language is an area of NLP studying the difficulty of training models with minimal data, annotation or resource (which can include hardware/ financial motivation). Limits to data quantity challenge many training approaches in machine learning, reducing models' ability to generalise and understand text. This thesis discusses how using multimodal models (those that can use images AND text to learn) can provide a much-needed grounding to the issue of understanding text. The introduction will approach this through the lens of detecting hate speech, a problem that typically disadvantages low-resource domains. After establishing the merits of using autoencoder approaches, the main body of work will focus on a training paradigm to bring the large-scale approaches to be replicable. In the subsequent chapters, alternative perspectives are considered, looking at the viability of the assignment assumptions that are being made to further boost gradients. This thesis addresses some of these problems by exploring some of the deeper rules underlying these generative methods. Learning low-resource languages is explored in this work as a challenge of scale: A paradigm is presented for training in an ecologically feasible and academically affordable manner. By reengineering previous methods, scale is addressed and a new algorithm for efficient training is presented. By altering the calculations and demonstrating how to add an additional encoder, this work provides a stepping stone to a greener, academically viable, and open model for low-resource domains. By addressing novel approaches to training low-resource language, this thesis also explores a different way to view training as an assignment problem, which can be abstracted, approximated and provide efficiency improvements to the more general training frameworks for both NLP and computer vision.

Item Type:
Thesis (PhD)
ID Code:
229087
Deposited By:
Deposited On:
28 Apr 2025 15:50
Refereed?:
No
Published?:
Published
Last Modified:
11 May 2025 00:03