Faculty Advisor

Stanley Selkow

Faculty Advisor

Micha Hofri

Faculty Advisor

Carolina Ruiz

Faculty Advisor

Elizabeth Ryder

Abstract

The main goal of this thesis work was to develop, implement and evaluate an algorithm that enables mining association rules from datasets that contain quantified distance information among the items. This was accomplished by extending and enhancing the Apriori Algorithm, which is the standard algorithm to mine association rules. The Apriori algorithm is not able to mine association rules that contain distance information among the items that construct the rules. This thesis enhances the main Apriori property by requiring itemsets forming rules to“deviate properly" in addition to satisfying the minimal support threshold. We say that an itemset deviates properly if all combinations of pair-wise distances among the items are highly conserved in the dataset instances where these items occur. This thesis introduces the notion of proper deviation and provides the precise procedure and measures that characterize it. Integrating the notion of distance preserving frequent itemset and proper deviation into the standard Apriori algorithm leads to the construction of our Distance-Based Association Rule Mining (DARM) algorithm. DARM can be applied in data mining and knowledge discovery from genetic, financial, retail, time sequence data, or any domain where the distance information between items is of importance. This thesis chose the area of gene expression and regulation in eukaryotic organisms as the application domain. The data from the domain was used to produce DARM rules. Sets of those rules were used for building predictive models. The accuracy of those models was tested. In addition, predictive accuracies of the models built with and without distance information were compared.

Publisher

Worcester Polytechnic Institute

Degree Name

MS

Department

Computer Science

Project Type

Thesis

Date Accepted

2003-05-06

Accessibility

Unrestricted

Subjects

spatial data mining, distance-based association rules, distance-based Apriori algorithm, Data mining, Gene expression, Data processing, Eukaryotic cells

Share

COinS