Автор: D.D. Janke
Издательство: IOS Press
Год: 2020
Страниц: 312
Язык: английский
Формат: pdf (true)
Размер: 10.1 MB
The distributed setting of RDF stores in the cloud poses many challenges, including how to optimize data placement on the compute nodes to improve query performance. In this book, a novel benchmarking methodology is developed for data placement strategies; one that overcomes these limitations by using a data-placement-strategy-independent distributed RDF store to analyze the effect of the data placement strategies on query performance. Frequently used data placement strategies have been evaluated, and this evaluation challenges the commonly held belief that data placement strategies which emphasize local computation lead to faster query executions. Indeed, results indicate that queries with a high workload can be executed faster on hash-based data placement strategies than on, for example, minimal edge-cut covers. The analysis of additional measurements indicates that vertical parallelization (i.e., a well-distributed workload) may be more important than horizontal containment (i.e., minimal data transport) for efficient query processing. Two such data placement strategies are proposed: the first, found in the literature, is entitled overpartitioned minimal edge-cut cover, and the second is the newly developed molecule hash cover.
Evaluation revealed a balanced query workload and a high horizontal containment, which lead to a high vertical parallelization. As a result, these strategies demonstrated better query performance than other frequently used data placement strategies. The book also tests the hypothesis that collocating small connected triple sets on the same compute node while balancing the amount of triples stored on the different compute nodes leads to a high vertical parallelization.
In order to speed up the query execution, researchers are investigating how the data placement strategies applied by the different RDF stores in the cloud to distribute the data items on the compute nodes affect the query execution time. For this purpose, some researchers evaluate the query execution times of several RDF stores in the cloud that use different data placement strategies. Evaluating and comparing the query performances of different RDF stores in the cloud is helpful to judge the overall performances of RDF stores.
Another way to find out data placement strategies that may lead to low query execution times is using a single cloud processing framework such as Apache Spark. This framework is used to create different graph covers of the same data set and execute queries on top of it afterwards.
In order to approximate the query execution times of RDF stores in the cloud that do not rely on cloud processing frameworks, evaluations in, propose a hybrid RDF store that can use different data placement strategies to process queries. In order to efficiently process queries that do not need to exchange intermediate results between compute nodes, each compute node stores its graph chunk in a local centralized RDF store. Queries that need to combine intermediate results from different compute nodes are executed within the cloud processing framework Hadoop MapReduce. The evaluations performed with this hybrid RDF store indicated that data placement strategies emphasizing local computation reduce the query execution time. However, since query processing on top of Hadoop MapReduce requires a potentially huge overhead of possibly several Hadoop jobs, these results may differ from RDF stores in the cloud that do not rely on cloud computing frameworks at all. Nevertheless, for queries that do not need the exchange of intermediate results, the made observations reflect the behavior of RDF stores in the cloud that do not use cloud processing frameworks.
Скачать Study on Data Placement Strategies in Distributed RDF Stores