D-CAPE: A Self-Tuning Continuous Query Plan Distribution Architecture

Sutherland, Timothy Michael

Etd

D-CAPE: A Self-Tuning Continuous Query Plan Distribution Architecture

Public

The study of systems for querying data streams, coined Data Stream Management Systems (DSMS), has gained in popularity over the last several years. This new area of research for the database community includes studies in areas such as Sensor Networks, Network Intrusion, and monitoring data such as Medicine, Stock, or Weather feeds. With this new popularity comes increased performance expectations, with increased data sizes and speed and larger more complex query plans as well as high volumes of possibly small queries. Due to the finite resources on a single query processor, future Data Stream Management Systems must distribute their workload to multiple query processors, working together in a synchronized manner. This thesis discusses a new Distributed Continuous Query System (D-CAPE) developed here at WPI that has the ability to distribute query plans over a large cluster of machines. We describe the architecture of the new system, policies for query plan distribution to improve overall performance, as well as techniques for self-tuning query plan re-distribution. D-CAPE is designed to be as flexible as possible for future research. We include a multi-tiered architecture that scales to a large number of query processors. D-CAPE has also been designed to minimize the cost of the communications network by bundling synchronization messages, thus minimizing packets sent between query processors. These messages are also incremental at run-time to aid in minimizing the communication cost of D-CAPE. The architecture allows for the flexible incorporation of different distribution algorithms and operator reallocation policies.. D-CAPE provides an operator reallocation algorithm that is able to seamlessly move an operator(s) across any query processors in our computing cluster. We do so by creating ``pipes"" between query processors to allow the data streams to flow, and then filling these pipes with data streams once execution begins. Operator redistribution is accomplished by systematically reconnecting these pipes as to not interrupt the data flow. Experimental evaluation using our real prototype system (not just simulation) shows that executing a query plan distributed over multiple machines causes no more overhead than processing it on a single centralized query processor; even for rather lightly loaded machines. Further, we find that distributing a query plan among a cluster of query processors can boost performance up to twice that of a centralized DSMS. We conclude that the limitation of each query processor within the distributed network of cooperating processors is not primarily in the volume of the data nor the number of query operators, but rather the number of data connections per processor and the allocation of the stateful and thus most costly operators. We also find that the overhead of distributing query operators is very low, allowing for a potentially frequent dynamic redistribution of query plans during execution.

Creator