Document Type


Publication Date



We introduce a novel data mining technique for the analysis of gene expression. Gene expression is the effective production of the protein that a gene encodes. We focus on the characterization of the expression patterns of genes based on their promoter regions. The promoter region of a gene contains short sequences called motifs to which gene regulatory proteins may bind, thereby controlling when and in which cell types the gene is expressed. Our approach addresses two important aspects of gene expression analysis: (1) Binding of proteins at more than one motif is usually required, and several different types of proteins may need to bind several different types of motifs in order to confer transcriptional specificity. (2) Since proteins controlling transcription may need to interact physically, we know that the order and spacing in which motifs occur can affect expression. We use association rules to address the combinatorial aspect. The association rules we employ have the ability to involve multiple motifs and to predict expression in multiple cell types. To address the second aspect, we enhance association rules with information about the distances among the motifs, or items, that are present in the rule. Rules of interest are those whose set of motifs deviates properly, i.e. set of motifs whose pair-wise distances are highly conserved in the promoter regions where these motifs occur. We describe the design, implementation, and evaluation of our Distance-based Association Rule Mining algorithm (DARM) to mine those rules. We show that these distance-based rules achieve higher classification performance than standard association rules over two real datasets.