The main goal of this thesis work was to develop, implement and evaluate an algorithm that enables mining association rules from datasets that contain quantified distance information among the items. This was accomplished by extending and enhancing the Apriori Algorithm, which is the standard algorithm to mine association rules. The Apriori algorithm is not able to mine association rules that contain distance information among the items that construct the rules. This thesis enhances the main Apriori property by requiring itemsets forming rules to“deviate properly" in addition to satisfying the minimal support threshold. We say that an itemset deviates properly if all combinations of pair-wise distances among the items are highly conserved in the dataset instances where these items occur. This thesis introduces the notion of proper deviation and provides the precise procedure and measures that characterize it. Integrating the notion of distance preserving frequent itemset and proper deviation into the standard Apriori algorithm leads to the construction of our Distance-Based Association Rule Mining (DARM) algorithm. DARM can be applied in data mining and knowledge discovery from genetic, financial, retail, time sequence data, or any domain where the distance information between items is of importance. This thesis chose the area of gene expression and regulation in eukaryotic organisms as the application domain. The data from the domain was used to produce DARM rules. Sets of those rules were used for building predictive models. The accuracy of those models was tested. In addition, predictive accuracies of the models built with and without distance information were compared.
Worcester Polytechnic Institute
All authors have granted to WPI a nonexclusive royalty-free license to distribute copies of the work. Copyright is held by the author or authors, with all rights reserved, unless otherwise noted. If you have any questions, please contact firstname.lastname@example.org.
Icev, Aleksandar, "DARM: Distance-Based Association Rule Mining" (2003). Masters Theses (All Theses, All Years). 773.
spatial data mining, distance-based association rules, distance-based Apriori algorithm, Data mining, Gene expression, Data processing, Eukaryotic cells