Faculty Advisor or Committee Member

Elke A. Rundensteiner, Advisor

Faculty Advisor or Committee Member

Xiangnan Kong, Committee Member

Faculty Advisor or Committee Member

Mohamed Y. Eltabakh, Committee Member

Faculty Advisor or Committee Member

Fei Wang, Committee Member




With the phenomenal growth of digital devices coupled with their ever-increasing capabilities of data generation and storage, sequential data is becoming more and more ubiquitous in a wide spectrum of application scenarios. There are various embodiments of sequential data such as temporal database, time series and text (word sequence) where the first one is synchronous over time and the latter two often generated in an asynchronous fashion. In order to derive precious insights, it is critical to learn and understand the behavior dynamics as well as the causality relationships across sequences. Pharmacovigilance is defined as the science and activities relating to the detection, assessment, understanding and prevention of adverse drug reactions (ADR) or other drug-related problems. In the post-marketing phase, the effectiveness and the safety of drugs is monitored by regulatory agencies known as post-marketing surveillance. Spontaneous Reporting System (SRS), e.g., U.S. Food and Drug Administration Adverse Event Reporting System (FAERS), collects drug safety complaints over time providing the key evidence to support regularity actions towards the reported products. With the rapid growth of the reporting volume and velocity, data mining techniques promise to be effective to facilitating drug safety reviewers performing supervision tasks in a timely fashion. My dissertation studies the problem of exploring, analyzing and modeling various types of sequential data within a typical SRS: Temporal Correlations Discovery and Exploration. SRS can be seen as a temporal database where each transaction encodes the co-occurrence of some reported drugs and observed ADRs in a time frame. Temporal association rule learning (TARL) has been proven to be a prime candidate to derive associations among the objects from such temporal database. However, TARL is parameterized and computational expensive making it difficult to use for discovering interesting association among drugs and ADRs in a timely fashion. Worse yet, existing interestingness measures fail to capture the significance of certain types of association in the context of pharmacovigilance, e.g. drug-drug interaction (DDI) related ADR. To discover DDI related ADR using TARL, we propose an interestingness measure that aligns with the DDI semantics. We propose an interactive temporal association analytics framework that supports real-time temporal association derivation and exploration. Anomaly Detection in Time Series. Abnormal reports may reveal meaningful ADR case which is overlooked by frequency-based data mining approach such as association rule learning where patterns are derived from frequently occurred events. In addition, the sense of abnormal or rareness may vary in different contexts. For example, an ADR, normally occurs to adult population, may rarely happen to youth population but with life threatening outcomes. Local outlier factor (LOF) is identified as a suitable approach to capture such local abnormal phenomenon. However, existing LOF algorithms and its variations fail to cope with high velocity data streams due to its high algorithmic complexity. We propose new local outlier semantics that leverage kernel density estimation (KDE) to effectively detect local outliers from streaming data. A strategy to continuously detect top-N KDE-based local outliers over streams is also designed, called KELOS -- the first linear time complexity streaming local outlier detection approach. Text Modeling. Language modeling (LM) is a fundamental problem in many natural language processing (NLP) tasks. LM is the development of probabilistic models that are able to predict the next word in the sequence given the words that precede it. Recently, LM is advanced by the success of the recurrent neural networks (RNNs) which overcome the Markov assumption made in the traditional statistical language models. In theory, RNNs such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) can “remember� arbitrarily long span of history if provided with enough capacity. However, they do not perform well on very long sequences in practice as the gradient computation for RNNs becomes increasingly ill-behaved as the expected dependency becomes longer. One way of tackling this problem is to feed succinct information that encodes the semantic structure of the entire document such as latent topics as context to guide the modeling process. Clinical narratives that describe complex medical events are often accompanied by meta-information such as a patient's demographics, diagnoses and medications. This structured information implicitly relates to the logical and semantic structure of the entire narrative, and thus affects vocabulary choices for the narrative composition. To leverage this meta-information, we propose a supervised topic compositional neural language model, called MeTRNN, that integrates the strength of supervised topic modeling in capturing global semantics with the capacity of contextual recurrent neural networks (RNN) in modeling local word dependencies.


Worcester Polytechnic Institute

Degree Name



Computer Science

Project Type


Date Accepted



Restricted-WPI community only


data mining, information management, pharmacovigilance, sequential data

Available for download on Saturday, April 23, 2022