Faculty Advisor

Randy C. Paffenroth

Faculty Advisor

Elke A. Rundensteiner

Identifier

etd-042716-231225

Abstract

Current real-world applications are generating a large volume of datasets that are often continuously updated over time. Detecting outliers on such evolving datasets requires us to continuously update the result. Furthermore, the response time is very important for these time critical applications. This is challenging. First, the algorithm is complex; even mining outliers from a static dataset once is already very expensive. Second, users need to specify input parameters to approach the true outliers. While the number of parameters is large, using a trial and error approach online would be not only impractical and expensive but also tedious for the analysts. Worst yet, since the dataset is changing, the best parameter will need to be updated to respond to user exploration requests. Overall, the large number of parameter settings and evolving datasets make the problem of efficiently mining outliers from dynamic datasets very challenging. Thus, in this thesis, we design an exploration framework for detecting outliers in data streams, called EFO, which enables analysts to continuously explore anomalies in dynamic datasets. EFO is a continuous lightweight preprocessing framework. EFO embraces two optimization principles namely "best life expectancy" and "minimal trial," to compress evolving datasets into a knowledge-rich abstraction of important interrelationships among data. An incremental sorting technique is also used to leverage the almost ordered lists in this framework. Thereafter, the knowledge abstraction generated by EFO not only supports traditional outlier detection requests but also novel outlier exploration operations on evolving datasets. Our experimental study conducted on two real datasets demonstrates that EFO outperforms state-of-the-art technique in terms of CPU processing costs when varying stream volume, velocity and outlier rate.

Publisher

Worcester Polytechnic Institute

Degree Name

MS

Department

Data Science

Project Type

Thesis

Date Accepted

2016-04-27

Accessibility

Unrestricted

Subjects

outliers, data streams, streaming window, outlier detection, distance-based outlier

Share

COinS