Faculty Advisor or Committee Member
Elke A. Rundensteiner, Advisor
Similarity search is a task fundamental to many machine learning and data analytics applications, where distance metric learning plays an important role. However, since modern online applications continuously produce objects with new characteristics which tend to change over time, state-of-the-art similarity search using distance metric learning methods tends to fail when deployed in such applications without taking the change into consideration.
In this work, we propose a Distance Metric Learning-based Continuous Similarity Search approach (CSS for short) to account for the dynamic nature of such data. CSS system adopts an online metric learning model to achieve distance metric evolving to adapt the dynamic nature of continuous data without large latency. To improve the accuracy of online metric learning model, a compact labeled dataset which is representative of the updated data is dynamically updated. Also, to accelerate similarity search, CSS includes an online maintained Locality Sensitive Hashing index to accelerate the similarity search.
One, our labeled data update strategy progressively enriches the labeled data to assure continued representativeness, yet without excessively growing its size to ensure that the computation costs of metric learning remain bounded. Two, our continuous distance metric learning strategy ensures that each update only requires one linear time k-NN search in contrast to the cubic time complexity of relearning the distance metric from scratch. Three, our LSH update mechanism leverages our theoretical insight that the LSH built based on the original distance metric is equally effective in supporting similarity search using the new distance metric as long as the transform matrix learned for the new distance metric is reversible. This important observation empowers CSS to avoid the modiﬁcation of LSH in most cases. Our experimental study using real-world public datasets and large synthetic datasets conﬁrms the effectiveness of CSS in improving the accuracy of classiﬁcation and information retrieval tasks. Also, CSS achieves 3 orders of magnitude speedup of our incremental distance metric learning strategy (and its three underlying components) over the state-of-art methods.
Worcester Polytechnic Institute
All authors have granted to WPI a nonexclusive royalty-free license to distribute copies of the work, subject to other agreements. Copyright is held by the author or authors, with all rights reserved, unless otherwise noted.
Restricted-WPI community only
Zhang, Hauyi, "Similarity Search in Continuous Data with Evolving Distance Metric" (2018). Masters Theses (All Theses, All Years). 1253.
distance metric learning lsh