Pivot-based Data Partitioning for Distributed k Nearest Neighbor Mining
This thesis addresses the need for a scalable distributed solution for k-nearest-neighbor (kNN) search, a fundamental data mining task. This unsupervised method poses particular challenges on shared-nothing distributed architectures, where global information about the dataset is not available to individual machines. The distance to search for neighbors is not known a priori, and therefore a dynamic data partitioning strategy is required to guarantee that exact kNN can be found autonomously on each machine. Pivot-based partitioning has been shown to facilitate bounding of partitions, however state-of-the-art methods suffer from prohibitive data duplication (upwards of 20x the size of the dataset). In this work an innovative method for solving exact distributed kNN search called PkNN is presented. The key idea is to perform computation over several rounds, leveraging pivot-based data partitioning at each stage. Aggressive data-driven bounds limit communication costs, and a number of optimizations are designed for efficient computation. Experimental study on large real-world data (over 1 billion points) compares PkNN to the state-of-the-art distributed solution, demonstrating that the benefits of additional stages of computation in the PkNN method heavily outweigh the added I/O overhead. PkNN achieves a data duplication rate close to 1, significant speedup over previous solutions, and scales effectively in data cardinality and dimension. PkNN can facilitate distributed solutions to other unsupervised learning methods which rely on kNN search as a critical building block. As one example, a distributed framework for the Local Outlier Factor (LOF) algorithm is given. Testing on large real-world and synthetic data with varying characteristics measures the scalability of PkNN and the distributed LOF framework in data size and dimensionality.