XQuery Full Text Search with Word1 and NOT Word2

Following is the XML structure -

<Docs>
  <Doc>
    <Name>Doc 1</Name>
    <Notes>
        <specialNote>
          This is a special note section. 
           <B>This B Tag is used for highlighting any text and is optional</B>        
           <U>This U Tag will underline any text and is optional</U>        
           <I>This I Tag is used for highlighting any text and is optional</I>        
        </specialNote>      
        <generalNote>
           <P>
            This will store the general notes and might have number of paragraphs. This is para no 1. NO Child Tags here         
           </P>
           <P>
            This is para no 2            
           </P>  
        </generalNote>      
    </Notes>  
    <Desc>
        <P>
          This is used for Description and might have number of paragraphs. Here too, there will be B, U and I Tags for highlighting the description text and are optional
          <B>Bold</B>
          <I>Italic</I>
          <U>Underline</U>
        </P>
        <P>
          This is description para no 2 with I and U Tags
          <I>Italic</I>
          <U>Underline</U>
        </P>      
    </Desc>
</Doc>

      

There will be 1000 tags Doc

. I want to give the user search criteria where he can search WORD1

and NOT WORD2

. Following is the request -

for $x in doc('Documents')/Docs/Doc[Notes/specialNote/text() contains text 'Tom' 
ftand  ftnot 'jerry' or 
Notes/specialNote/text() contains text 'Tom' ftand ftnot 'jerry' or 
Notes/specialNote/B/text() contains text 'Tom' ftand ftnot 'jerry' or 
Notes/specialNote/I/text() contains text 'Tom' ftand ftnot 'jerry' or 
Notes/specialNote/U/text() contains text 'Tom' ftand ftnot 'jerry' or
Notes/generalNote/P/text() contains text 'Tom' ftand ftnot 'jerry' or 
Desc/P/text() contains text 'Tom' ftand ftnot 'jerry' or 
Desc/P/B/text() contains text 'Tom' ftand ftnot 'jerry' or 
Desc/P/I/text() contains text 'Tom' ftand ftnot 'jerry' or 
Desc/P/U/text() contains text 'Tom' ftand ftnot 'jerry']
return $x/Name

      

The result of this query is incorrect. I mean the result contains some document with Tom

and jerry

. So I changed the request to -

for $x in doc('Documents')/Docs/Doc[. contains text 'Tom' ftand ftnot 'jerry'] 
return $x/Name

      

This query gives me the exact result, i.e. only those documents with Tom

and Not jerry

, BUT GOING ON A HUGE TIME ... Approx. 45 seconds, while the previous one took 10 seconds!

I am using XML BaseX 7.5 database.

We need expert comments on this :)

+3


source to share


1 answer


The first request checks each text node in the document separately, so it <P><B>Tom</B> and <I>Jerry</I></P>

will match because the first text node contains Tom, but not Jerry.

The second query performs a full-text search on the entire text content of the elements Doc

as if they were concatenated into one string. There is no answer to this ( BaseX full-text index which indexes each text node separately.

The solution would be to do a full text search for each term separately and merge the results at the end. This can be done for each text node separately, so an index can be used:

for $x in (doc('Documents')/Docs/Doc[.//text() contains text 'Tom']
            except doc('Documents')/Docs/Doc[.//text() contains text 'Jerry'])
return $x/Name

      



The above query is rewritten by the query optimizer to this equivalent using two index calls:

for $x in (db:fulltext("Documents", "Tom")/ancestor::*:Doc
            except db:fulltext("Documents", "Jerry")/ancestor::*:Doc)
return $x/Name

      

You can even customize the order in which you combine the results so that the intermediate results are small if you want.

+4


source







All Articles