Lancaster EPrints

Low-Density Cluster Separators for Large, High-Dimensional, Mixed and Non-Linearly Separable Data.

Yates, Katie and Pavlidis, Nicos and Sherlock, Christopher (2018) Low-Density Cluster Separators for Large, High-Dimensional, Mixed and Non-Linearly Separable Data. PhD thesis, Lancaster University.

[img]
Preview
PDF (2018katieyatesphd) - Published Version
Available under License Creative Commons Attribution-NonCommercial.

Download (14Mb) | Preview

    Abstract

    The location of groups of similar observations (clusters) in data is a well-studied problem, and has many practical applications. There are a wide range of approaches to clustering, which rely on different definitions of similarity, and are appropriate for datasets with different characteristics. Despite a rich literature, there exist a number of open problems in clustering, and limitations to existing algorithms. This thesis develops methodology for clustering high-dimensional, mixed datasets with complex clustering structures, using low-density cluster separators that bi-partition datasets using cluster boundaries that pass through regions of minimal density, separating regions of high probability density, associated with clusters. The bi-partitions arising from a succession of minimum density cluster separators are combined using divisive hierarchical and partitional algorithms, to locate a complete clustering, while estimating the number of clusters. The proposed algorithms locate cluster separators using one-dimensional arbitrarily oriented subspaces, circumventing the challenges associated with clustering in high-dimensional spaces. This requires continuous observations; thus, to extend the applicability of the proposed algorithms to mixed datasets, methods for producing an appropriate continuous representation of datasets containing non-continuous features are investigated. The exact evaluation of the density intersected by a cluster boundary is restricted to linear separators. This limitation is lifted by a non-linear mapping of the original observations into a feature space, in which a linear separator permits the correct identification of non-linearly separable clusters in the original dataset. In large, high-dimensional datasets, searching for one-dimensional subspaces, which result in a minimum density separator is computationally expensive. Therefore, a computationally efficient approach to low-density cluster separation using approximately optimal projection directions is proposed, which searches over a collection of one-dimensional random projections for an appropriate subspace for cluster identification. The proposed approaches produce high-quality partitions, that are competitive with well-established and state-of-the-art algorithms.

    Item Type: Thesis (PhD)
    Subjects:
    Departments: Faculty of Science and Technology > Mathematics and Statistics
    Lancaster University Management School > Management Science
    ID Code: 89488
    Deposited By: ep_importer_pure
    Deposited On: 09 Jan 2018 11:50
    Refereed?: No
    Published?: Published
    Last Modified: 14 Oct 2018 02:09
    Identification Number:
    URI: http://eprints.lancs.ac.uk/id/eprint/89488

    Actions (login required)

    View Item