Kerim, Abdulrahman and Soriano Marcolino, Leandro and Jiang, Richard (2024) Synthetic Data for Machine Learning. PhD thesis, Lancaster University.
Abstract
Supervised machine learning methods require large-scale training datasets to converge. Collecting and annotating training data is expensive, time-consuming, error-prone, and not always practical. Usually, synthetic data is used as a feasible data source to increase the amount of training data. However, just directly using synthetic data may actually harm the model’s performance or may not be as effective as it could be. This thesis addresses the challenges of generating large-scale synthetic data, improving domain adaptation in semantic segmentation, advancing video stabilization in adverse conditions, and conducting a rigorous assessment of synthetic data usability in classification tasks. By contributing novel solutions to these multifaceted problems, this work bolsters the field of computer vision, offering strong foundations for a broad range of applications for utilizing synthetic data for computer vision tasks. In this thesis, we divide the study into three main problems: (i) Tackle the problem of generating diverse and photorealistic synthetic data; (ii) Explore synthetic-aware computer vision solutions for semantic segmentation and video stabilization; (iii) Assess the usability of synthetically generated data for different computer vision tasks. We developed a new synthetic data generator called Silver. Photo-realism, diversity, scalability, and full 3D virtual world generation at run-time are the key aspects of this generator. The photo-realism was approached by utilizing the stateof-the-art High Definition Render Pipeline (HDRP) of the Unity game engine. In parallel, the Procedural Content Generation (PCG) concept was employed to create a full 3D virtual world at run-time, while the scalability (expansion and adaptability) of the system was attained by taking advantage of the modular approach followed as we built the system from scratch. Silver can be used to provide clean, unbiased, and large-scale training and testing data for various computer vision tasks. Regarding synthetic-aware computer vision models, we developed a novel architecture specifically designed to use synthetic training data for semantic segmentation domain adaptation. We propose a simple yet powerful addition to DeepLabV3+ by using weather and time-of-the-day supervisors trained with multitask learning, making it both weather and nighttime-aware, which improves its mIoU accuracy under adverse conditions while maintaining adequate performance under standard conditions. Similarly, we also proposed a synthetic-aware adverse weather video stabilization algorithm that dispenses real data for training, relying solely on synthetic data. Our approach leverages specially generated synthetic data to avoid the feature extraction issues faced by current methods. To achieve this, we leveraged our novel data generator to produce the required training data with an automatic ground-truth extraction procedure. We also propose a new dataset called VSAC105Real and compare our method to five recent video stabilization algorithms using two benchmarks. Our method generalizes well on real-world videos across all weather conditions and does not require large-scale synthetic training data. Finally, we assess the usability of the generated synthetic data. We propose a novel usability metric that disentangles photorealism from diversity. This new metric is a simple yet effective way to rank synthetic images. The quantitative results show that we can achieve similar or better results by training on 50% less synthetic data. Additionally, we qualitatively assess the impact of photorealism and evaluate many architectures on different datasets for that aim.