Faculty Advisor or Committee Member

Elke A. Rundensteiner, Advisor

Identifier

etd-121218-215328

Abstract

Similarity search is a task fundamental to many machine learning and data analytics applications, where distance metric learning plays an important role. However, since modern online applications continuously produce objects with new characteristics which tend to change over time, state-of-the-art similarity search using distance metric learning methods tends to fail when deployed in such applications without taking the change into consideration.

In this work, we propose a Distance Metric Learning-based Continuous Similarity Search approach (CSS for short) to account for the dynamic nature of such data. CSS system adopts an online metric learning model to achieve distance metric evolving to adapt the dynamic nature of continuous data without large latency. To improve the accuracy of online metric learning model, a compact labeled dataset which is representative of the updated data is dynamically updated. Also, to accelerate similarity search, CSS includes an online maintained Locality Sensitive Hashing index to accelerate the similarity search.

One, our labeled data update strategy progressively enriches the labeled data to assure continued representativeness, yet without excessively growing its size to ensure that the computation costs of metric learning remain bounded. Two, our continuous distance metric learning strategy ensures that each update only requires one linear time k-NN search in contrast to the cubic time complexity of relearning the distance metric from scratch. Three, our LSH update mechanism leverages our theoretical insight that the LSH built based on the original distance metric is equally effective in supporting similarity search using the new distance metric as long as the transform matrix learned for the new distance metric is reversible. This important observation empowers CSS to avoid the modification of LSH in most cases. Our experimental study using real-world public datasets and large synthetic datasets confirms the effectiveness of CSS in improving the accuracy of classification and information retrieval tasks. Also, CSS achieves 3 orders of magnitude speedup of our incremental distance metric learning strategy (and its three underlying components) over the state-of-art methods.

Publisher

Worcester Polytechnic Institute

Degree Name

MS

Department

Data Science

Project Type

Thesis

Date Accepted

2018-12-12

Accessibility

Restricted-WPI community only

Subjects

distance metric learning lsh

Share

COinS