How to improve indexing of large SPARQL datasets?

Here is a very simple SPARQL query that takes a very long time (10 seconds) to run in Marklogic (8.0-6.4). What can I do to speed it up?

The data is based on a subset of geonames and is in the same order (about 22 million triples, it seems).

PREFIX  gj:   <>
PREFIX  rdf:  <>
PREFIX  gn:   <>

  { ?this_0  rdf:type  gj:LocalCounty ;
             gn:name   ?name_1 .
ORDER BY ASC(?name_1)
LIMIT   100



In MarkLogic's suggestion, I ran a query that introduced a new property to the DB specific to the local county:

  GRAPH <> {
    ?this gj:localCountyName ?name .
    ?this a gj:LocalCounty .
    ?this gn:name ?name .


I also suggested some suggested query options:

PREFIX  gj:   <>
PREFIX  rdf:  <>
PREFIX  gn:   <>

SELECT ?this_0 ?name_1
  { ?this_0  rdf:type  gj:LocalCounty ;
             gj:localCountyName   ?name_1 .
ORDER BY ?name_1
LIMIT   20


This reduces the total request time to ~ 4 seconds, which is better but still huge.

Track information from the above request:

2017-05-04 12:00:18.684 Info: <triple-value-statistics count="147540458" unique-subjects="25064012" unique-predicates="81" unique-objects="67600843" xmlns="cts:triple-value-statistics">
2017-05-04 12:00:18.684 Info:   <triple-value-entries>
2017-05-04 12:00:18.684 Info:     <triple-value-entry count="8385355">
2017-05-04 12:00:18.684 Info:       <triple-value></triple-value>
2017-05-04 12:00:18.684 Info:       <subject-statistics count="0" unique-predicates="0" unique-objects="0"/>
2017-05-04 12:00:18.684 Info:       <predicate-statistics count="8356279" unique-subjects="8341989" unique-objects="13"/>
2017-05-04 12:00:18.684 Info:       <object-statistics count="0" unique-subjects="0" unique-predicates="0"/>
2017-05-04 12:00:18.684 Info:     </triple-value-entry>
2017-05-04 12:00:18.684 Info:     <triple-value-entry count="29204">
2017-05-04 12:00:18.684 Info:       <triple-value></triple-value>
2017-05-04 12:00:18.684 Info:       <subject-statistics count="2" unique-predicates="2" unique-objects="2"/>
2017-05-04 12:00:18.684 Info:       <predicate-statistics count="0" unique-subjects="0" unique-objects="0"/>
2017-05-04 12:00:18.684 Info:       <object-statistics count="29202" unique-subjects="29202" unique-predicates="3"/>
2017-05-04 12:00:18.684 Info:     </triple-value-entry>
2017-05-04 12:00:18.684 Info:     <triple-value-entry count="29201">
2017-05-04 12:00:18.684 Info:       <triple-value></triple-value>
2017-05-04 12:00:18.684 Info:       <subject-statistics count="0" unique-predicates="0" unique-objects="0"/>
2017-05-04 12:00:18.684 Info:       <predicate-statistics count="29201" unique-subjects="29201" unique-objects="26692"/>
2017-05-04 12:00:18.684 Info:       <object-statistics count="0" unique-subjects="0" unique-predicates="0"/>
2017-05-04 12:00:18.684 Info:     </triple-value-entry>
2017-05-04 12:00:18.684 Info:   </triple-value-entries>
2017-05-04 12:00:18.684 Info: </triple-value-statistics>
2017-05-04 12:00:18.684 Info: [Event:id=SPARQL AST] sessionKey=7777437449602930525
2017-05-04 12:00:18.684 Info:   initialPlan=SPARQLModule[
2017-05-04 12:00:18.684 Info:   Prolog[]
2017-05-04 12:00:18.684 Info:   SPARQLSelect[SPARQLLimit[
2017-05-04 12:00:18.684 Info:       LIMIT GraphNode[Literal "20"^^<>]
2017-05-04 12:00:18.684 Info:       SPARQLProject[order(1)
2017-05-04 12:00:18.684 Info:         GraphNode[Var this_0 0]
2017-05-04 12:00:18.684 Info:         GraphNode[Var name_1 1]
2017-05-04 12:00:18.684 Info:         SPARQLOrder[order(1) UNSORTED
2017-05-04 12:00:18.684 Info:           OrderSpec[
2017-05-04 12:00:18.684 Info:             Variable[QName[(Unknown) name_1] 1]
2017-05-04 12:00:18.684 Info:             ASCENDING EMPTY MIN]
2017-05-04 12:00:18.684 Info:           SPARQLMergeJoin[order(0) hash(0==0) scatter()
2017-05-04 12:00:18.684 Info:             TriplePattern[order(0,1) PSO
2017-05-04 12:00:18.684 Info:               GraphNode[Var this_0 0]
2017-05-04 12:00:18.684 Info:               GraphNode[IRI <>]
2017-05-04 12:00:18.684 Info:               GraphNode[Var name_1 1]]
2017-05-04 12:00:18.684 Info:             TriplePattern[order(0) OPS
2017-05-04 12:00:18.684 Info:               GraphNode[Var this_0 0]
2017-05-04 12:00:18.684 Info:               GraphNode[IRI <>]
2017-05-04 12:00:18.684 Info:               GraphNode[IRI <>]]]]]]]]
2017-05-04 12:00:18.684 Info: [Event:id=SPARQL Cost Analysis] sessionKey=7777437449602930525 optimize=1 r=3 t=1.28811 os=360 is=15 mutations=9 seed=15212683942933123635
2017-05-04 12:00:18.684 Info:   initialCost=(m:6.02656e+07,r:0,io:(52.931/1.20414e+07/0),cpu(2):(0/116805/0),mem:87603,c:20,crd:[20,20])
2017-05-04 12:00:18.726 Info: [Event:id=SPARQL Cost Analysis] sessionKey=7777437449602930525 diff=0 diff%=0 r=0
2017-05-04 12:00:18.726 Info:   cost=(m:6.02656e+07,r:0,io:(52.931/1.20414e+07/0),cpu(2):(0/116805/0),mem:87603,c:20,crd:[20,20])
2017-05-04 12:00:18.726 Info: [Event:id=SPARQL Cost Analysis] sessionKey=7777437449602930525 diff=0 diff%=0 r=1
2017-05-04 12:00:18.726 Info:   cost=(m:6.02656e+07,r:0,io:(52.931/1.20414e+07/0),cpu(2):(0/116805/0),mem:87603,c:20,crd:[20,20])
2017-05-04 12:00:18.728 Info: [Event:id=SPARQL Cost Analysis] sessionKey=7777437449602930525 diff=0 diff%=0 r=2
2017-05-04 12:00:18.728 Info:   cost=(m:6.02656e+07,r:0,io:(52.931/1.20414e+07/0),cpu(2):(0/116805/0),mem:87603,c:20,crd:[20,20])
2017-05-04 12:00:18.728 Info: [Event:id=SPARQL Cost Analysis] sessionKey=7777437449602930525
2017-05-04 12:00:18.728 Info:   bestCost=(m:6.02656e+07,r:0,io:(52.931/1.20414e+07/0),cpu(2):(0/116805/0),mem:87603,c:20,crd:[20,20])
2017-05-04 12:00:18.729 Info: [Event:id=SPARQL AST] sessionKey=7777437449602930525
2017-05-04 12:00:18.729 Info:   plan=SPARQLModule[
2017-05-04 12:00:18.729 Info:   Prolog[]
2017-05-04 12:00:18.729 Info:   SPARQLSelect[SPARQLLimit[
2017-05-04 12:00:18.729 Info:       LIMIT GraphNode[Literal "20"^^<>]
2017-05-04 12:00:18.729 Info:       SPARQLProject[order(1)
2017-05-04 12:00:18.729 Info:         GraphNode[Var this_0 0]
2017-05-04 12:00:18.729 Info:         GraphNode[Var name_1 1]
2017-05-04 12:00:18.729 Info:         SPARQLOrder[order(1) UNSORTED
2017-05-04 12:00:18.729 Info:           OrderSpec[
2017-05-04 12:00:18.729 Info:             Variable[QName[(Unknown) name_1] 1]
2017-05-04 12:00:18.729 Info:             ASCENDING EMPTY MIN]
2017-05-04 12:00:18.729 Info:           SPARQLMergeJoin[order(0) hash(0==0) scatter()
2017-05-04 12:00:18.729 Info:             TriplePattern[order(0,1) PSO
2017-05-04 12:00:18.729 Info:               GraphNode[Var this_0 0]
2017-05-04 12:00:18.729 Info:               GraphNode[IRI <>]
2017-05-04 12:00:18.729 Info:               GraphNode[Var name_1 1]]
2017-05-04 12:00:18.729 Info:             TriplePattern[order(0) OPS
2017-05-04 12:00:18.729 Info:               GraphNode[Var this_0 0]
2017-05-04 12:00:18.729 Info:               GraphNode[IRI <>]
2017-05-04 12:00:18.729 Info:               GraphNode[IRI <>]]]]]]]]


semantic-web sparql marklogic marklogic-8

source to share

2 answers

Depending on your hardware (memory, processor, disks), you can increase performance by increasing the number of forests.


source to share

MarkLogic uses a scalable architecture, so there is no guarantee of scalable performance with a single machine. The best way to scale up is to add more nodes, in particular e-nodes with enough memory on each.


source to share

All Articles