Detail buku

No BukuT.LN.10.25
UniversitasUniversitas of Tsukuba
PenulisImam Machdi
Judul BukuA STUDY ON PARALLEL HOLISTIC TWIG JOINS FOR XML QUERY PROCESSING
PembimbingProf. Koichi Wada
AbstrakABSTRAK INGGRIS : Along with the growing popularity of the World Wide Web,XML query processing systems play such important roles to process query patterns against such large XML database. One of the core operations in XML query processing system is to find in XML database all matches that satisfy a specified query pattern. Among other algorithms for processing query patterns, the family of holistic twig join algorithms is prominent due to the performance advantage and the capability of handling complex query patterns. Several crucial issues have risen on XML query processing systems that typically manage large collections of heterogeneous XML documents and processes numerous concurrent query patterns. By the time, the tremendous size of XML documents and query patterns definitely overwhelms the processing capabilities to maintain good performance. This challenging issue to maintain and improve the system performance has attracted many researches in the area of parallel XML query processing on shared-nothing cluster system and multi-core system. Another challenging issue-XML data partitioning, which is fundamental to provide the basis of data parallelism, is essentially complex to perform since XML data and its structure have to be preserved in partitions. In addition to data parallelism, other forms of parallelism such as task parallelism, intra-query parallelism and inter-query parallelism are considered as an important issue to potentially improve the parallel XML query processing system performance. Hence, we see the challenging issues as the problem addressed in the parallel holistic twig join for XML query processing. This dissertation compiles the study on parallel holistic twig joins for XML query processing and presents three main contributions, First, we propose a novel static XML data partitioning scheme on a cluster system, the Grid Metadata for XML (GMX). GMX is a conceptual model that is composed of XML metadata derived from static information of XML documents and queries in the past. The model specifies relationship between XML documents and queries in the form of grids as logical partitions. Based on the relationship, a grid is associated with a cost of processing queries for XML documents. The basic notion of partitioning is to decompose a grid with a high cost to finer sub-grids with lower costs. To partition XML data logically in grids, GMX is facilitated with a set of hierarchical partitioning steps: XML document clustering, query clustering, document-based refinement, query-based refinement, and subquery-based refinement that yield partitions from coarse granularity to relatively fine granularity, respectively. Different partition granularities will assist the allocation approach to allocate partitions for achieving workload balance in a cluster system. The second contribution is to propose the Stream-based Partitioning for XML (SPX), which is an on-the-fly XML data partitioning scheme, on a cluster system. The SPX is aimed for handling workload imbalance during query execution due to changing patterns of query execution in the system. It repartitions XML data, which is previously partitioned and allocated by GMX scheme, and reallocates partitions from highly loaded processing nodes to lightly loaded processing nodes for achieving dynamic workload balance. The target of partitioning is streams of XML nodes, rather than XML documents. The SPX scheme characterizes itself as straightforward computation and providing finer partitions, which are certainly appropriate for on-the-fly partitioning. Besides, a partition generated by the SPX scheme contains complete solutions for a query without data dependency on other partitions and eliminates unnecessary XML nodes for faster processing. Lastly, we propose a framework of the parallel holistic twig join algorithm on a multi-core system as the third contribution. The framework aims at developing general parallelism techniques for data parallelism and task parallelism to speed up a single long running query. In data parallelism, the SPX scheme is adopted to partition streams of XML on-the-fly. An important technique is to estimate the partition size for providing finer parallelism and reducing memory access contention. In task parallelism, the holistic twig join algorithm is decomposed into two tasks that are parallelized using the pipelining technique. Another important technique is to estimate the size of data transfer from task one to task two for achieving optimal parallelism. As the result, the performance of task parallelism is able to enhance the performance of data parallelism
JenisThesis