Data warehousing is becoming an increasingly important technology for information integration and data analysis. Given the dynamic nature of modern distributed environments, both source data updates and schema changes are likely to occur autonomously and even concurrently in different data sources. Current approaches to maintain a data warehouse in such dynamic environments sequentially schedule maintenance processes to occur in isolation. Furthermore, each maintenance process is handling the maintenance of one single source update. This limits the performance of current data warehouse maintenance systems in a distributed environment where the maintenance of source updates endures the overhead of network delay as well as IO costs for each maintenance query. In this thesis work, we propose two different optimization strategies which can greatly improve data warehouse maintenance performance for a set of source updates in such dynamic environments. Both strategies are able to support source data updates and schema changes. The first strategy, the parallel data warehouse maintainer, schedules multiple maintenance processes concurrently. Based on the DWMS_Transaction model, we formalize the constraints that exist in maintaining data and schema changes concurrently and propose several parallel maintenance process schedulers. The second strategy, the batch data warehouse maintainer, groups multiple source updates and then maintains them within one maintenance process. We propose a technique for compacting the initial sequence of updates, and then for generating delta changes for each source. We also propose an algorithm to adapt/maintain the data warehouse extent using these delta changes. A further optimization of the algorithm also is applied using shared queries in the maintenance process. We have designed and implemented both optimization strategies and incorporated them into the existing DyDa/TxnWrap system. We have conducted extensive experiments on both the parallel as well as the batch processing of a set of source updates to study the performance achievable under various system settings. Our findings include that our parallel maintenance gains around 40 ~ 50% performance improvement compared to sequential processing in environments that use single-CPU machines and little network delay, i.e, without requiring any additional hardware resources. While for batch processing, an improvement of 400 ~ 500% improvement compared with sequential maintenance is achieved, however at the cost of less frequent refreshes of the data warehouse content.
Worcester Polytechnic Institute
All authors have granted to WPI a nonexclusive royalty-free license to distribute copies of the work. Copyright is held by the author or authors, with all rights reserved, unless otherwise noted. If you have any questions, please contact firstname.lastname@example.org.
Liu, Bin, "Optimization Strategies for Data Warehouse Maintenance in Distributed Environments" (2002). Masters Theses (All Theses, All Years). 539.
Schema Change, Batch, Concurrent, Data Warehouse Maintenance, Parallel, Data Update, Data warehousing, Multitasking (Computer science), Electronic data processing, Distributed processing