TY - JOUR
T1 - Lusail: A System for Querying Linked Data at Scale
AU - Abdelazizu, Ibrahim
AU - Mansouru, Essam
AU - Ouzzaniu, Mourad
AU - Aboulnagau, Ashraf
AU - Kalnisu, Panos
N1 - KAUST Repository Item: Exported on 2021-07-08
PY - 2017/12/1
Y1 - 2017/12/1
N2 - The RDF data model allows publishing interlinked RDF datasets, where each dataset is independently maintained and is queryable via a SPARQL endpoint. Many applications would benefit from querying the resulting large, decentralized, geo-distributed graph through a federated SPARQL query processor. A crucial factor for good performance in federated query processing is pushing as much computation as possible to the local endpoints. Surprisingly, existing federated SPARQL engines are not effective at this task since they rely only on schema information. Consequently, they cause unnecessary data retrieval and communication, leading to poor scalability and response time. This paper addresses these limitations and presents Lusail, a scalable and efficient federated SPARQL system for querying large RDF graphs that are geo-distributed on different endpoints. Lusail uses a novel query rewriting algorithm to push computation to the local endpoints by relying on information about the RDF instances and not only the schema. The query rewriting algorithm has the additional advantage of exposing parallelism in query processing, which Lusail exploits through advanced scheduling at query run time. Our experiments on billions of triples of real and synthetic data show that Lusail outperforms state-of-the-art systems by orders of magnitude in terms of scalability and response time.
AB - The RDF data model allows publishing interlinked RDF datasets, where each dataset is independently maintained and is queryable via a SPARQL endpoint. Many applications would benefit from querying the resulting large, decentralized, geo-distributed graph through a federated SPARQL query processor. A crucial factor for good performance in federated query processing is pushing as much computation as possible to the local endpoints. Surprisingly, existing federated SPARQL engines are not effective at this task since they rely only on schema information. Consequently, they cause unnecessary data retrieval and communication, leading to poor scalability and response time. This paper addresses these limitations and presents Lusail, a scalable and efficient federated SPARQL system for querying large RDF graphs that are geo-distributed on different endpoints. Lusail uses a novel query rewriting algorithm to push computation to the local endpoints by relying on information about the RDF instances and not only the schema. The query rewriting algorithm has the additional advantage of exposing parallelism in query processing, which Lusail exploits through advanced scheduling at query run time. Our experiments on billions of triples of real and synthetic data show that Lusail outperforms state-of-the-art systems by orders of magnitude in terms of scalability and response time.
UR - http://hdl.handle.net/10754/670044
UR - http://dl.acm.org/doi/10.1145/3186728.3164144
U2 - 10.1145/3186728.3164144
DO - 10.1145/3186728.3164144
M3 - Article
SN - 2150-8097
VL - 11
SP - 485
EP - 498
JO - PROCEEDINGS OF THE VLDB ENDOWMENT
JF - PROCEEDINGS OF THE VLDB ENDOWMENT
IS - 4
ER -