Etd

Similarity Search in Continuous Data with Evolving Distance Metric

Public

Similarity search is a task fundamental to many machine learning and data analytics applications, where distance metric learning plays an important role. However, since modern online applications continuously produce objects with new characteristics which tend to change over time, state-of-the-art similarity search using distance metric learning methods tends to fail when deployed in such applications without taking the change into consideration. In this work, we propose a Distance Metric Learning-based Continuous Similarity Search approach (CSS for short) to account for the dynamic nature of such data. CSS system adopts an online metric learning model to achieve distance metric evolving to adapt the dynamic nature of continuous data without large latency. To improve the accuracy of online metric learning model, a compact labeled dataset which is representative of the updated data is dynamically updated. Also, to accelerate similarity search, CSS includes an online maintained Locality Sensitive Hashing index to accelerate the similarity search. One, our labeled data update strategy progressively enriches the labeled data to assure continued representativeness, yet without excessively growing its size to ensure that the computation costs of metric learning remain bounded. Two, our continuous distance metric learning strategy ensures that each update only requires one linear time k-NN search in contrast to the cubic time complexity of relearning the distance metric from scratch. Three, our LSH update mechanism leverages our theoretical insight that the LSH built based on the original distance metric is equally effective in supporting similarity search using the new distance metric as long as the transform matrix learned for the new distance metric is reversible. This important observation empowers CSS to avoid the modification of LSH in most cases. Our experimental study using real-world public datasets and large synthetic datasets confirms the effectiveness of CSS in improving the accuracy of classification and information retrieval tasks. Also, CSS achieves 3 orders of magnitude speedup of our incremental distance metric learning strategy (and its three underlying components) over the state-of-art methods.

Creator
Contributors
Degree
Unit
Publisher
Language
  • English
Identifier
  • etd-121218-215328
Keyword
Advisor
Defense date
Year
  • 2018
Date created
  • 2018-12-12
Resource type
Rights statement

Relations

In Collection:

Items

Items

Permanent link to this page: https://digital.wpi.edu/show/3j3332367