Optimizing Marklogic Search Query for Profiling Results

Question

Optimizing Marklogic Search Query for Profiling Results

Hi MarkLoggers,

I have a question for you again! I have a collection of documents containing zip code information. 400,000 documents. Documents are ordered one zip code for each document, each document contains 400 functions, ordered by category and variable as follows:

<postcode id="9728" xmlns="http://www.nvsp.nl/p4">
<meta-data>
<!--
Generated by DIKW for NetwerkVSP ST!P
-->
<version>0.3</version>
<dateCreated>2014-06-28+02:00</dateCreated>
</meta-data>
<category name="Oplages">
<variable name="Oplage" updated="2014-08-12+02:00">
  <segment name="Bruto">1234</segment>
  <segment name="Stickers">234</segment>
  <segment name="Netto">1000</segment>
  <segment name="Aktief">J</segment>
</variable>
</category>
<category name="Automotive">
<variable name="Leaseauto">
<segment name="Leaseauto">2.68822210725987</segment>
</variable>
<variable name="Autotype">
<segment name="De Oudere Stadsrijder">4.61734781858941</segment>
<segment name="De Dure Tweedehandsrijder">6.02534919813761</segment>
<segment name="De Autoloze">41.187790998448</segment>
<segment name="De Leasende Veelrijder">0.608035868253147</segment>
<segment name="De Modale Middenklasser">13.1996896016555</segment>
<segment name="De Vermogende Autoliefhebber">4.45283669598206</segment>
<segment name="De Vermogende Kilometervreter">2.07690981203656</segment>
<segment name="De Doelmatige Budgetrijder">17.2048629073978</segment>
<segment name="De Doorsnee Nieuw Kopende Automob">10.1595102603897</segment>
</variable>
...
400 more cat/var/segment element
...
</postcode>

I need to find a subset of documents based on the id attribute in a postcode element and only return certain elements.

Items to return are in cat Oplages var Oplage and I need Bruto and Netto segments

We now have a rest api extension that does this, but not fast enough.

Example request:

xquery version "1.0-ml";
declare namespace html = "http://www.w3.org/1999/xhtml";
declare namespace p4ns       = "http://www.nvsp.nl/p4";
declare namespace wijkns     = "http://www.nvsp.nl/wijk";

let $segment := "Bruto"

let $zoeker0 := cts:search(fn:doc(), cts:element-attribute-range-query(xs:QName("p4ns:postcode"), xs:QName("id"), "=", ("2311","2312","2313"))) 
let $zoeker1 := cts:search(/p4ns:postcode, cts:element-attribute-range-query(xs:QName("p4ns:postcode"), xs:QName("id"), "=", ("2311","2312","2313"))) 
let $zoeker2 := cts:search(/p4ns:postcode, cts:element-attribute-value-query(xs:QName("p4ns:postcode"), xs:QName("id"), ("2311","2312","2313"))) 

let $inhoud1 := $zoeker0//p4ns:segment[@name=$segment]
let $inhoud2 := $zoeker1//p4ns:segment[@name=$segment]/text()

let $r1 := cts:search(/p4ns:postcode, cts:element-attribute-range-query(xs:QName("p4ns:segment"), xs:QName("name"), "=", $segment))

return $inhoud2

Now if I profile this test query, the slower part is looking at the "Bruto" segment in the documents returned by cts: search. I know that I should avoid searching for elements in documents via xpath, but I don't know how to concatenate two bits that only go into indices ...

Profiling result:

.main:13:44 1446    27  7127    30  7938    @name = "Bruto"
.main:12:44 1446    27  6956    30  7793    @name = "Bruto"
.main:17:11 1   9.3     2431    9.4     2458    cts:search(fn:collection()/p4ns:postcode, cts:element-attribute-range-query(xs:QName("p4ns:segment"), fn:QName("", "name"), "=", $segment))
.main:10:16 1   7.2     1874    7.2     1885    cts:search(fn:collection()/p4ns:postcode, cts:element-attribute-value-query(xs:QName("p4ns:postcode"), fn:QName("", "id"), ("2311", "2312", "2313")))

Query result:

1234
4567
3456

NOW my question (s):

1) What does "@name =" Bruto "" mean, and why is it slow?

2) Ideally, I would combine document search with segment element search via xpath into one combination, but if I put $ zoeker in cts: search it was not recognized ... What is the best approach to get my result back in one go?

thanks in advance!

Hugo

+3

performance xquery xpath marklogic

Hugo koopmans 13 Aug 14 at 13:29

source to share

1 answer

mblakele · Accepted Answer · 2014-08-13T17:05:55+0000

I see two main problems: there are too many trips in the database and those trips are returning too much data that you don't really need. The goal is to keep the number of database searches to a minimum and to make each search as accurate as possible.

In this case, the main way to perform database lookups is cts:search

. There are several: perhaps too many, and sometimes the results are never used. I think some of them are leftover experiments. When you profile it is important to clean up clean code.

Next, most of the profiler's time is in the @name=$segment

XPath predicate . This was repeated without a good reason. Get rid of the repetition and it will go faster.

However, another reason @name=$segment

is that MarkLogic indexes documents, not nodes. It indexes the names and values of the nodes, but each index entry points to a document, or rather to a fragment, but not there. So when you have one document with tens or hundreds of index entries for values segment/@name

, all of those index entries point to the document root. When you only query for segments that match a specific name, the index lookup matches the entire document. Therefore, each document tree must be evaluated. This can be costly in CPU cycle, and that's what the profiler shows.

There is no cure for this without restructuring the document, or perhaps doing something smart with coincidences. However, we can clean up your query and convert it to a single XPath expression using full paths. See if this is enough for your use case.

declare namespace p4ns="http://www.nvsp.nl/p4" ;

(: These might be external parameters. :)
let $segment := "Bruto"
let $ids := ("2311","2312","2313")
return collection()/p4ns:postcode[
  @id = $ids]/p4ns:category/p4ns:variable/p4ns:segment[
  @name = $segment]/string()

If I insert your XML sample and change my id to 2313

, this returns a single value 1234

. Profiling shows 33 expressions in less than 1ms, with 66% of the time searching the database through XPath. However, it still has to look at all values segment/@name

: in this case, 14 of them take 10% of the time.

Note that I have not used cts:search

any of your range indices either. MarkLogic automatically indexes the node value for queries to match XPath values. You only need range indices for special operations: for example, graphs, sorting, and finding inequality.

You could improve this a bit:

(collection()/p4ns:postcode[
  @id = $ids]/p4ns:category/p4ns:variable/p4ns:segment[
  @name = $segment])[1]/string()

We now tell the evaluator that only one match was expected. This way it will stop after it finds Bruto

and what is at the beginning of the document. In this case, it is the first, but on average it (...)[1]

should cut the number of expressions in half. Other tree trimming practices should also help: for example, you can add names category

and variable

their input and express them as predicates XPath.

This may be a good time to back up and see the big picture. What are you trying to accomplish with this query? There may be a much more efficient way to achieve your goal.

If this is your most common use case, then ideally you would restructure your documents so that every lookup in the id segment becomes a computable call doc($uri)

. I'm not sure if it's a good idea in this particular case, but I don't have complete knowledge of your application.

Another approach is to use in-memory value indexes and https://docs.marklogic.com/cts:value-co-occurrences to not look at XML at all. However, this is a complex approach, and I am not going to explore it here.

Optimizing Marklogic Search Query for Profiling Results

More articles: