How can I optimize a SPARQL query that returns additional properties?

How can I optimize my SPARQL query as follows?

The purpose of this request:

  • Specify resource (country resource where countryCode = "US"

    )
  • Get additional properties defined on a resource.

Unfortunately, blocks OPTIONAL

are evaluated before the parent block, which forces the query engine to load all data for all countries.

I want some kind of behavior LEFT OUTER JOIN

, but the request engine doesn't handle it that way.

What can I do to improve query performance?

SELECT  *
WHERE
  { 
    ?type (rdfs:subClassOf)* gj:Country .
    ?this_0  rdf:type        ?type ;
             gn:countryCode  "US"
    # each of these blocks is executed as a standalone query in the engine
    OPTIONAL
      { ?this_0  gn:countryCode  ?countryCode_1}
    OPTIONAL
      { ?this_0  gn:name  ?name_2}
    OPTIONAL
      { ?this_0 gj:cscId  ?cscId_3} 
  }

      

I am using SPARQL REST endpoint in MarkLogic 8.4.

Update:

I tried an option query optimize=2

, but that didn't give me a significant performance improvement:

/v1/graphs/sparql?optimize=2

Related: How to Specify Parameters in SPARQL REST Endpoint for MarkLogic?

Update 2:

Even if I do one of the optional properties, the request is still slow:

WHERE
  {
        ?type (rdfs:subClassOf)* gj:Country .
        ?this_0  rdf:type        ?type ;
             gn:countryCode  "US"; gj:cscId ?cscId_3 ;
  }

      

Do I need to do something special to index this gj: cscId property?

Update 3:

Here is the profile information from the Query Console.

Request profile

Update 4:

Here is the diagnostic information about the trace:

2017-04-27 13:30:17.238 Info: [Event:id=SPARQL Value Frequencies] sessionKey=13846462700334370907 namedGraphs=0 values=
2017-04-27 13:30:17.238 Info: <triple-value-statistics count="154569757" unique-subjects="25445373" unique-predicates="104" unique-objects="67520361" xmlns="cts:triple-value-statistics">
2017-04-27 13:30:17.238 Info:   <triple-value-entries>
2017-04-27 13:30:17.238 Info:     <triple-value-entry count="181">
2017-04-27 13:30:17.238 Info:       <triple-value>http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country</triple-value>
2017-04-27 13:30:17.238 Info:       <subject-statistics count="2" unique-predicates="2" unique-objects="2"/>
2017-04-27 13:30:17.238 Info:       <predicate-statistics count="0" unique-subjects="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info:       <object-statistics count="179" unique-subjects="179" unique-predicates="4"/>
2017-04-27 13:30:17.238 Info:     </triple-value-entry>
2017-04-27 13:30:17.238 Info:     <triple-value-entry count="15">
2017-04-27 13:30:17.238 Info:       <triple-value>http://www.w3.org/2000/01/rdf-schema#subClassOf</triple-value>
2017-04-27 13:30:17.238 Info:       <subject-statistics count="0" unique-predicates="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info:       <predicate-statistics count="15" unique-subjects="15" unique-objects="5"/>
2017-04-27 13:30:17.238 Info:       <object-statistics count="0" unique-subjects="0" unique-predicates="0"/>
2017-04-27 13:30:17.238 Info:     </triple-value-entry>
2017-04-27 13:30:17.238 Info:     <triple-value-entry count="8739716">
2017-04-27 13:30:17.238 Info:       <triple-value>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</triple-value>
2017-04-27 13:30:17.238 Info:       <subject-statistics count="0" unique-predicates="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info:       <predicate-statistics count="8359510" unique-subjects="8341619" unique-objects="14"/>
2017-04-27 13:30:17.238 Info:       <object-statistics count="0" unique-subjects="0" unique-predicates="0"/>
2017-04-27 13:30:17.238 Info:     </triple-value-entry>
2017-04-27 13:30:17.238 Info:     <triple-value-entry count="8697064">
2017-04-27 13:30:17.238 Info:       <triple-value>http://www.geonames.org/ontology#countryCode</triple-value>
2017-04-27 13:30:17.238 Info:       <subject-statistics count="2" unique-predicates="2" unique-objects="2"/>
2017-04-27 13:30:17.238 Info:       <predicate-statistics count="8323137" unique-subjects="8323137" unique-objects="517"/>
2017-04-27 13:30:17.238 Info:       <object-statistics count="1" unique-subjects="1" unique-predicates="1"/>
2017-04-27 13:30:17.238 Info:     </triple-value-entry>
2017-04-27 13:30:17.238 Info:     <triple-value-entry count="2119305">
2017-04-27 13:30:17.238 Info:       <triple-value datatype="http://www.w3.org/2001/XMLSchema#string">US</triple-value>
2017-04-27 13:30:17.238 Info:       <subject-statistics count="0" unique-predicates="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info:       <predicate-statistics count="0" unique-subjects="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info:       <object-statistics count="2061783" unique-subjects="2061783" unique-predicates="3"/>
2017-04-27 13:30:17.238 Info:     </triple-value-entry>
2017-04-27 13:30:17.238 Info:     <triple-value-entry count="13946907">
2017-04-27 13:30:17.238 Info:       <triple-value>http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#cscId</triple-value>
2017-04-27 13:30:17.238 Info:       <subject-statistics count="3" unique-predicates="3" unique-objects="3"/>
2017-04-27 13:30:17.238 Info:       <predicate-statistics count="11739004" unique-subjects="11739004" unique-objects="11739004"/>
2017-04-27 13:30:17.238 Info:       <object-statistics count="0" unique-subjects="0" unique-predicates="0"/>
2017-04-27 13:30:17.238 Info:     </triple-value-entry>
2017-04-27 13:30:17.238 Info:   </triple-value-entries>
2017-04-27 13:30:17.238 Info: </triple-value-statistics>
2017-04-27 13:30:17.239 Info: [Event:id=SPARQL AST] sessionKey=13846462700334370907
2017-04-27 13:30:17.239 Info:   initialPlan=SPARQLModule[
2017-04-27 13:30:17.239 Info:   Prolog[]
2017-04-27 13:30:17.239 Info:   SPARQLSelect[SPARQLProject[order()
2017-04-27 13:30:17.239 Info:       GraphNode[Var type 0]
2017-04-27 13:30:17.239 Info:       GraphNode[Var this_0 1]
2017-04-27 13:30:17.239 Info:       GraphNode[Var cscId_3 2]
2017-04-27 13:30:17.239 Info:       SPARQLLeftNestedLoopJoin[order() hash(1==1) scatter(1 = 1)
2017-04-27 13:30:17.239 Info:         SPARQLNestedLoopJoin[order() hash(1==1) scatter(1 = 1)
2017-04-27 13:30:17.239 Info:           SPARQLScatterJoin[order(0,1) hash(0==0) scatter(0 = 0)
2017-04-27 13:30:17.239 Info:             SPARQLZeroOrOne[
2017-04-27 13:30:17.239 Info:               GraphNode[Var type 0]
2017-04-27 13:30:17.239 Info:               GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
2017-04-27 13:30:17.239 Info:               SPARQLScatterOneOrMore[
2017-04-27 13:30:17.239 Info:                 GraphNode[Var type 0]
2017-04-27 13:30:17.239 Info:                 GraphNode[Var ANON16629111911678922088 0]
2017-04-27 13:30:17.239 Info:                 GraphNode[Var ANON7634081659815295853 1]
2017-04-27 13:30:17.239 Info:                 GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
2017-04-27 13:30:17.239 Info:                 TriplePattern[order(0,1) PSO
2017-04-27 13:30:17.239 Info:                   GraphNode[Var ANON16629111911678922088 0]
2017-04-27 13:30:17.239 Info:                   GraphNode[IRI <http://www.w3.org/2000/01/rdf-schema#subClassOf>]
2017-04-27 13:30:17.239 Info:                   GraphNode[Var ANON7634081659815295853 1]]]]
2017-04-27 13:30:17.239 Info:             TriplePattern[order(0,1) OPS
2017-04-27 13:30:17.239 Info:               GraphNode[Var this_0 1]
2017-04-27 13:30:17.239 Info:               GraphNode[IRI <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>]
2017-04-27 13:30:17.239 Info:               GraphNode[Var type 0]]]
2017-04-27 13:30:17.239 Info:           TriplePattern[order(1) SOP
2017-04-27 13:30:17.239 Info:             GraphNode[Var this_0 1]
2017-04-27 13:30:17.239 Info:             GraphNode[IRI <http://www.geonames.org/ontology#countryCode>]
2017-04-27 13:30:17.239 Info:             GraphNode[Literal "US"]]]
2017-04-27 13:30:17.239 Info:         TriplePattern[order(1,2) PSO
2017-04-27 13:30:17.239 Info:           GraphNode[Var this_0 1]
2017-04-27 13:30:17.239 Info:           GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#cscId>]
2017-04-27 13:30:17.239 Info:           GraphNode[Var cscId_3 2]]]]]]
2017-04-27 13:30:17.239 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 optimize=1 r=3 t=1.28811 os=360 is=15 mutations=30 seed=7088858925989728751
2017-04-27 13:30:17.239 Info:   initialCost=(m:5.99223e+11,r:0,io:(52.9404/167736/1.17487e+09),cpu(1):(0/1.77017e+08/1.18652e+12),mem:8185,c:1.03266e+07,crd:[14,2.06178e+06,1.03266e+07])
2017-04-27 13:30:17.320 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 diff=-5.98971e+11 diff%=-99.958 r=0
2017-04-27 13:30:17.320 Info:   cost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
2017-04-27 13:30:17.320 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 diff=-5.98971e+11 diff%=-99.958 r=1
2017-04-27 13:30:17.320 Info:   cost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
2017-04-27 13:30:17.326 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 diff=-5.98971e+11 diff%=-99.958 r=2
2017-04-27 13:30:17.326 Info:   cost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
2017-04-27 13:30:17.326 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907
2017-04-27 13:30:17.326 Info:   bestCost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
2017-04-27 13:30:17.326 Info: [Event:id=SPARQL AST] sessionKey=13846462700334370907
2017-04-27 13:30:17.326 Info:   plan=SPARQLModule[
2017-04-27 13:30:17.326 Info:   Prolog[]
2017-04-27 13:30:17.326 Info:   SPARQLSelect[SPARQLProject[order(1,0)
2017-04-27 13:30:17.326 Info:       GraphNode[Var type 0]
2017-04-27 13:30:17.326 Info:       GraphNode[Var this_0 1]
2017-04-27 13:30:17.326 Info:       GraphNode[Var cscId_3 2]
2017-04-27 13:30:17.326 Info:       SPARQLRightMergeJoin[order(1,0) hash(1==1) scatter()
2017-04-27 13:30:17.326 Info:         TriplePattern[order(1,2) PSO
2017-04-27 13:30:17.326 Info:           GraphNode[Var this_0 1]
2017-04-27 13:30:17.326 Info:           GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#cscId>]
2017-04-27 13:30:17.326 Info:           GraphNode[Var cscId_3 2]]
2017-04-27 13:30:17.326 Info:         SPARQLHashJoin[order(1,0) hash(0==0) scatter()
2017-04-27 13:30:17.326 Info:           SPARQLZeroOrOne[
2017-04-27 13:30:17.326 Info:             GraphNode[Var type 0]
2017-04-27 13:30:17.326 Info:             GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
2017-04-27 13:30:17.326 Info:             SPARQLBloomOneOrMore[
2017-04-27 13:30:17.326 Info:               GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
2017-04-27 13:30:17.326 Info:               GraphNode[Var ANON7634081659815295853 1]
2017-04-27 13:30:17.326 Info:               GraphNode[Var ANON16629111911678922088 0]
2017-04-27 13:30:17.326 Info:               GraphNode[Var type 0]
2017-04-27 13:30:17.326 Info:               TriplePattern[order(0,1) PSO
2017-04-27 13:30:17.326 Info:                 GraphNode[Var ANON16629111911678922088 0]
2017-04-27 13:30:17.326 Info:                 GraphNode[IRI <http://www.w3.org/2000/01/rdf-schema#subClassOf>]
2017-04-27 13:30:17.326 Info:                 GraphNode[Var ANON7634081659815295853 1]]]]
2017-04-27 13:30:17.326 Info:           SPARQLMergeJoin[order(1,0) hash(1==1) scatter()
2017-04-27 13:30:17.326 Info:             TriplePattern[order(1) OPS
2017-04-27 13:30:17.326 Info:               GraphNode[Var this_0 1]
2017-04-27 13:30:17.326 Info:               GraphNode[IRI <http://www.geonames.org/ontology#countryCode>]
2017-04-27 13:30:17.326 Info:               GraphNode[Literal "US"]]
2017-04-27 13:30:17.326 Info:             TriplePattern[order(1,0) PSO
2017-04-27 13:30:17.326 Info:               GraphNode[Var this_0 1]
2017-04-27 13:30:17.326 Info:               GraphNode[IRI <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>]
2017-04-27 13:30:17.326 Info:               GraphNode[Var type 0]]]]]]]]

      

Update 5:

In some cases, I've found that I can exclude the property path expression ?type

from the request. In one such case, performance improved by two orders of magnitude:

WHERE
  { 
    ?this_0  rdf:type        gj:Country ;
             gn:countryCode  "US"
    # each of these blocks is executed as a standalone query in the engine
    OPTIONAL
      { ?this_0  gn:countryCode  ?countryCode_1}
    OPTIONAL
      { ?this_0  gn:name  ?name_2}
    OPTIONAL
      { ?this_0 gj:cscId  ?cscId_3} 
  }

      

Since this solution changes the result of the query, it doesn't solve all of our use cases.

It looks like the problem is not with the FACTS themselves, but with something to do with the property path expression obfuscated by the query planner so that the properties in the ADVANCED view blocks are looked up independently (which is not executable).

+3


source to share


2 answers


The query optimizer relies on statistics to determine the best order of operations. There is often a bounding triple pattern that can be used to restrict further operations using scatter.

In your case, the statistics do not provide such an obvious restrictive triple pattern. You can see by looking at the triple value statistics output that the string "US" occurs 2,061,783 times as an object - so it's not terribly restrictive.

gj: Country IRI is restrictive (179 times in object position), but unfortunately you need to use that on the right side of the transitive closure operator. It is very difficult to predict how many results the transitive closure operator will return as it is highly dependent on the actual data.

You will find that using a property path like the one below will allow MarkLogic to avoid the null-one operator, which can be a small performance boost:



?this_0 a/rdfs:subClassOf* gj:Country .

      

Going further, if you know that (for example) there is only one gj: country with country code "USA", you can add a constraint on this part of the query to give the optimizer hints on how to handle the query, that is:

select * {
  {
    select * {
      ?this_0 a/rdfs:subClassOf* gj:Country .
      ?this_0  gn:countryCode  'US' .
    } limit 1
  }
  OPTIONAL { ?this_0 gj:cscId  ?cscId_3 } 
}

      

+5


source


Marklogic 8 has a performance issue with property paths with *

. Try to replace

?type (rdfs:subClassOf)* gj:Country .

      



from

{
   BIND(gj:Country AS ?type)
} UNION {
   ?type (rdfs:subClassOf)+ gj:Country .
}

      

0


source







All Articles