Peng, Hankui and Pavlidis, Nicos and Eckley, Idris (2020) Subspace Clustering and Active Learning with Constraints. PhD thesis, Lancaster University.
Abstract
Data representations can often be high-dimensional, whether it is due to the large number of collected / recorded features or due to how the data sources (e.g. images, texts) are processed. It is often the case that the main structure of the data can be summarised well in a lower dimensional subspace or multiple lower dimensional subspaces. Subspace clustering addresses the problem of simultaneously uncovering multiple subspace structures in the data and grouping the data according to their underlying subspace structures. The first contribution of this thesis is the development of a Subspace Clustering with Active Learning (SCAL) framework that is designed for Subspace Clustering. This framework allows clustering performance to improve in an effective and efficient manner over time, with the need to query only a small amount of labelling information. It also has the potential to be applied to more general subspace clustering methods, which has been further explored and developed in our next methodological contribution. The second contribution of this thesis is a unified active learning and constrained clustering framework for spectral-based subspace clustering methods. In this work, we propose a spectral-based subspace clustering methodology named Weighted Sparse Simplex Representation (WSSR). It has been demonstrated to have favourable performance against state-of-the-art spectral-based subspace clustering methods on both synthetic and real data. We also propose a flexible weighting scheme that can incorporate external information into the problem formulation, which leads to a constrained clustering extension of WSSR. We show that it can be applied in conjunction with our previously proposed SCAL strategy when labelling information can be queried sequentially. The third contribution of this thesis is the development of an algebraic subspace clustering methodology – Minimum Angle Clustering (MAC). It is motivated by the application of clustering Amazon products based on their titles when represented using the TF-IDF matrix, which is both sparse and high-dimensional. The proposed methodology is composed of two stages. In the first stage, it identifies a large number of subspaces in the data through the Reduced Row Echelon Form technique. In the second stage, we propose a new subspace proximity measure to construct an affinity matrix for the formed subspaces before spectral clustering is applied to obtain the final cluster labels. The proposed methodology has been shown to enjoy competitive performance against a number of well-established subspace clustering and document clustering techniques on the application of clustering Amazon product names.