编辑: You—灰機 | 2017-09-24 |
修改时间: 2018-09-20;
采用时间: 2018-11-01 王鑫 等:大规模 RDF 图数据上高效率分布式查询处理
499 Abstract: Knowledge graphs are the main representation form of intelligent data. With the development of knowledge graphs, more and more intelligent data has been released in the form of the resource description framework (RDF). It is known that the semantics of SPARQL correspond to graph homomorphism which is an NP-complete problem. Therefore, how to efficiently answer SPARQL queries in parallel over big RDF graphs has been widely recognized as a challenging problem. There are some research works using the MapReduce computational model to process big RDF graph. However, SPARQL queries in these works are decomposed into the set of query clauses without considering any semantics and graph structure embedded in RDF graph, which leads to overmuch MapReduce iterations. This study first decomposes the SPARQL query graph into a set of stars by utilizing the semantic and structural information embedded RDF graphs as heuristics, which can be matched in fewer MapReduce iterations. Meanwhile, a matching order of these stars is given to reduce intermediate results in MapReduce iterations. During the matching phase, each round of MapReduce adds one star with the join operation. The extensive experiments on both synthetic dataset WatDiv, and real-world dataset DBpedia are carried out. The experiments results demonstrate that the proposed star decomposition-based method can answer SPARQL BGP queries efficiently, which outperforms SHARD and S2X by one order of magnitude. Finally, extensive experiments show that the performance of the optimization algorithms is improved by 49.63% and 78.71% than the basic algorithm over both synthetic and real datasets. Key words: star decomposition;
distributed;
basic graph pattern matching;
large scale RDF graphs;
MapReduce 智能数据管理是人工智能发展的重要基石.知识图谱是智能数据的主要表现形式.资源描述框架(resource description framework,简称 RDF)是由国际万维网联盟(W3C)制定的表示知识图谱的推荐标准.随着知识图谱相 关技术的不断发展,RDF 三元组数据日益激增,并且被广泛地应用在多个领域,包括科学、生物信息、商业智能 和社交网络等[1] .在现实世界中,RDF 数据集往往达到数亿条三元组数据.目前,如何有效管理大规模 RDF 智能 图数据受到越来越多的关注.与传统的关系型数据库中进行的查询不同,RDF 数据上 SPARQL 的基本图模式 (BGP)操作语义等价于数学上的子图同态[2] ,这是一个著名的 NP 完全问题[3] ,即SPARQL 查询可以转化为子图 匹配问题.因此,提高 SPARQL 查询性能已成为亟待解决的技术问题. RDF 数据模型是一种特殊的语义图数据模型,表示为实体-关系或者类关系图,基本构件包括实体、属性和 值,通过属性和值描述资源之间的关系.每条 RDF 三元组数据的格式为(s,p,o),是一个陈述,其中,s 是主语,p 是谓 语,o 是宾语.(s,p,o)表示资源 s 与资源 o 之间具有联系 p,或表示资源 s 具有属性 p 且其取值为 o.RDF 数据的主 语和谓语都是以统一资源标识符(uniform resource identifier,简称 URI)来标识,宾语以 URI 或者字面值(literal) 标识.一条三元组数据可以被看做一条有向边,其中,主语和宾语为顶点.一个三元组的宾语可作为另一个三元 组中的主语,于是,RDF 数据形成一种有向标签图. RDF 智能图数据上的查询语言随着 RDF 数据的发展不断发展,早期查询语言包括 RQL,RDQL,SquishQLD 等[4] .目前被广泛使用的是 SPARQL(simple protocol and RDF query language),该查询语言是 W3C 提出的 RDF 图上的标准查询语言.基本图模式(basic graph pattern,简称 BGP)是SPARQL 中最常用、最基本的查询模式.实 际上,每个 SPARQL 查询的核心部分是一个 BGP 查询.例如:在维基百科数据集 DBpedia 中查询内陆国家及其人 口数,SPARQL BGP 查询如图