Counting all unique words in an unstructured document using index data

Question

Counting all unique words in an unstructured document using index data

I have uploaded unstructured HTML documents to Marklogic and, for any given document URI, I need a way to use indices / lexicons to provide word counts for all unique words.

For example, let's say I have the file below, stored in the URI "/html/example.html":

<html>
<head><title>EXAMPLE</title></head>
<body>
<h1>This is a header</h1>
<div class="highlight">This word is highlighted</div>
<p> And these words are inside a paragraph tag</p>
</body>
</html>

In XQuery, I would call my function passing through a, passing in a URI and getting the following results:

EXAMPLE 1
This 2
is 2
a 2
header 1
word 1
highlighted 1
And 1
these 1
words 1
are 1
inside 1
paragraph 1
tag 1

Note that I only need the word count for the words inside the tags, not the tags themselves.

Is there a way to do this efficiently (using index or vocabulary data?)

Thank,

grifster

+3

marklogic

user987205 20 Aug At 11:10

source to share

2 answers

Typically for this purpose, you usually use cts:frequency

. Unfortunately, this can only be granted to meanings derived from lexicon meanings, not by meanings from word lexicons. So I am afraid that you will have to do the counting by hand if you cannot tokenize all the words upfront into an element where you can put the index of the range. The closest I could think of is:

for $word in cts:words()
let $freq := count(cts:search(doc()//*,$word))
order by $freq descending
return concat($word, ' - ', $freq)

Note. doc () will search all documents, so this doesn't scale well. But if you're interested in the number of records per document, the performance might be good enough for you.

0

grtjn 20 Aug 14 at 11:44

source to share

mblakele · Accepted Answer · 2014-08-20T17:50:04+0000

You are asking for the word count "for any given document URI". But you are assuming the solution includes indices or lexicons, and this is not necessarily a good guess. If you want something specific to a document from a document-oriented database, it is often best to work with the document directly.

So, let's focus on an efficient word counting solution for one document and from there. OK?

Here we can get the word count for one element, including any children. This may be the root of your document: doc($uri)/*

.

declare function local:word-count($root as element())
as map:map
{
  let $m := map:map()
  let $_ := cts:tokenize(
    $root//text())[. instance of cts:word]
    ! map:put($m, ., 1 + (map:get($m, .), 0)[1])
  return $m
};

This creates a map which I find more flexible than flat text. Each key is a word and the value is a counter. The variable $doc

already contains your sample XML.

let $m := local:word-count($doc)
for $k in map:keys($m)
return text { $k, map:get($m, $k) }

inside 1
This 2
is 2
paragraph 1
highlighted 1
EXAMPLE 1
header 1
are 1
word 1
words 1
these 1
tag 1
And 1
a 2

Note that the order of the card keys is not defined. Add a sentence order by

if you like.

let $m := local:word-count($doc)
for $k in map:keys($m)
let $v := map:get($m, $k)
order by $v descending
return text { $k, $v }

If you want to query the entire database, Geert's solution with help cts:words

might seem pretty good. It uses a lexicon for a list of words and some search terms for word matching. But this will ultimately result in XML for every matching document for every word-word: O (nm). To do this, the code will have to do the same work as it does local:word-count

, but one word at a time. Many words will match the same documents: "maybe in" A "and" B ", and" then "can also be in A and B. Despite the use of lexicons and indices, this approach will usually be slower than just applying local:word-count

to a whole database.

If you want to query the entire database and want to change the XML, you can wrap each word in an element word

(or whatever element name you prefer). Then create an index of the range of elements of the string type word

. You can now use cts:values

and cts:frequency

to pull the answer directly from the range index. It will be O (n) with much less cost than the approach cts:words

, and probably faster than local:word-count

that, because it won't visit any documents at all. But the resulting XML is pretty clunky.

Go back and apply local:word-count

to the entire database. Start by setting up the code to have the caller deliver the card. This way we can create a single word count map for the entire database, and we will only look at each document once.

declare function local:word-count(
  $m as map:map,
  $root as element())
as map:map
{
  let $_ := cts:tokenize(
    $root//text())[. instance of cts:word]
    ! map:put($m, ., 1 + (map:get($m, .), 0)[1])
  return $m
};

let $m := map:map()
let $_ := local:word-count($m, collection()/*)
for $k in map:keys($m)
let $v := map:get($m, $k)
order by $v descending
return text { $k, $v }

On my laptop, I processed 151 documents in less than 100ms. There were about 8100 words and 925 different words. Getting the same results from cts:words

and cts:search

took less than 1 sec. So local:word-count

more efficient and probably efficient enough for the job.

Now that you can effectively build a word-word map, what can you save it? Basically, you would create your own "index" of words. It's easy because the maps are XML serialized.

(: Construct a map. :)
map:map()
(: The document constructor creates a document-node with XML inside. :)
! document { . }
(: Construct a map from the XML root element. :)
! map:map(*)

This way you can call local:word-count

on every new XML document when it is inserted or updated. Then save the word card in the document properties. Do this using a CPF pipeline, or using your own code via a RecordLoader, or on a REST load endpoint, etc.

If you need word count for one document, it's just a call xdmp:document-properties

or xdmp:document-get-properties

, then call the constructor map:map

in the right XML. If you need word counts for multiple documents, you can easily write XQuery to combine these maps into a single result.

Counting all unique words in an unstructured document using index data

More articles: