XQuery Full Text Search with Word1 and NOT Word2
Following is the XML structure -
<Docs>
<Doc>
<Name>Doc 1</Name>
<Notes>
<specialNote>
This is a special note section.
<B>This B Tag is used for highlighting any text and is optional</B>
<U>This U Tag will underline any text and is optional</U>
<I>This I Tag is used for highlighting any text and is optional</I>
</specialNote>
<generalNote>
<P>
This will store the general notes and might have number of paragraphs. This is para no 1. NO Child Tags here
</P>
<P>
This is para no 2
</P>
</generalNote>
</Notes>
<Desc>
<P>
This is used for Description and might have number of paragraphs. Here too, there will be B, U and I Tags for highlighting the description text and are optional
<B>Bold</B>
<I>Italic</I>
<U>Underline</U>
</P>
<P>
This is description para no 2 with I and U Tags
<I>Italic</I>
<U>Underline</U>
</P>
</Desc>
</Doc>
There will be 1000 tags Doc
. I want to give the user search criteria where he can search WORD1
and NOT WORD2
. Following is the request -
for $x in doc('Documents')/Docs/Doc[Notes/specialNote/text() contains text 'Tom'
ftand ftnot 'jerry' or
Notes/specialNote/text() contains text 'Tom' ftand ftnot 'jerry' or
Notes/specialNote/B/text() contains text 'Tom' ftand ftnot 'jerry' or
Notes/specialNote/I/text() contains text 'Tom' ftand ftnot 'jerry' or
Notes/specialNote/U/text() contains text 'Tom' ftand ftnot 'jerry' or
Notes/generalNote/P/text() contains text 'Tom' ftand ftnot 'jerry' or
Desc/P/text() contains text 'Tom' ftand ftnot 'jerry' or
Desc/P/B/text() contains text 'Tom' ftand ftnot 'jerry' or
Desc/P/I/text() contains text 'Tom' ftand ftnot 'jerry' or
Desc/P/U/text() contains text 'Tom' ftand ftnot 'jerry']
return $x/Name
The result of this query is incorrect. I mean the result contains some document with Tom
and jerry
. So I changed the request to -
for $x in doc('Documents')/Docs/Doc[. contains text 'Tom' ftand ftnot 'jerry']
return $x/Name
This query gives me the exact result, i.e. only those documents with Tom
and Not jerry
, BUT GOING ON A HUGE TIME ... Approx. 45 seconds, while the previous one took 10 seconds!
I am using XML BaseX 7.5 database.
We need expert comments on this :)
source to share
The first request checks each text node in the document separately, so it <P><B>Tom</B> and <I>Jerry</I></P>
will match because the first text node contains Tom, but not Jerry.
The second query performs a full-text search on the entire text content of the elements Doc
as if they were concatenated into one string. There is no answer to this ( BaseX full-text index which indexes each text node separately.
The solution would be to do a full text search for each term separately and merge the results at the end. This can be done for each text node separately, so an index can be used:
for $x in (doc('Documents')/Docs/Doc[.//text() contains text 'Tom']
except doc('Documents')/Docs/Doc[.//text() contains text 'Jerry'])
return $x/Name
The above query is rewritten by the query optimizer to this equivalent using two index calls:
for $x in (db:fulltext("Documents", "Tom")/ancestor::*:Doc
except db:fulltext("Documents", "Jerry")/ancestor::*:Doc)
return $x/Name
You can even customize the order in which you combine the results so that the intermediate results are small if you want.
source to share