Optimal way to store XML data in Marlogic

Question

Optimal way to store XML data in Marlogic

I am new to Marklogic world. My program uses a custom Java application to request more .com. Receive XML data feed every 30 seconds. Results are returned in XML format. Java application uses XCC API (Marklogic API) to insert extracted data into ML in one XML file. The data size is 6MB every minute, if the application runs for a day or so, the amount of data will grow in GB. I don't know of any admin configuration I have to do to put this amount of huge data into one XML file in MarkLogic. Can anyone confirm my approach or suggest if I have to make any configuration changes at the admin level. The XML structure looks like this:

<?xml version="1.0" encoding="UTF-8"?>      
<moreovercontentdump>        
<article id="_6232903453">           
<description></description>
<author></author>       
<source_category>Local</source_category>    
<genre>General</genre>  
<publisher></publisher> 
<media_type>text</media_type>   
<docurl>http://www.ilrestodelcarlino.it</docurl>    
<harvest_time>Apr  4 2012  4:28PM</harvest_time>    
<valid_time>May 14 2012  4:27PM</valid_time>    
</article>
<article id="_6232903453">           
<description></description>
<author></author>       
<source_category>Local</source_category>    
<genre>General</genre>  
<publisher></publisher> 
<media_type>text</media_type>   
<docurl>http://www.ilrestodelcarlino.it</docurl>    
<harvest_time>Apr  4 2012  4:28PM</harvest_time>    
<valid_time>May 14 2012  4:27PM</valid_time>    
</article>
<article id="_6232903453">           
<description></description>
<author></author>       
<source_category>Local</source_category>    
<genre>General</genre>  
<publisher></publisher> 
<media_type>text</media_type>   
<docurl>http://www.ilrestodelcarlino.it</docurl>    
<harvest_time>Apr  4 2012  4:28PM</harvest_time>    
<valid_time>May 14 2012  4:27PM</valid_time>    
</article>
</moreovercontentdump>

+3

marklogic

Pankaj 04 Apr 12 at 20:27

source to share

2 answers

mblakele · Answer 1 · 2012-04-05T02:39:00+0000

Looking at the XML sample, I think you probably want to save each article in your own document. You can write a FLWOR expression to invoke xdmp:document-insert

or invoke xdmp:spawn

if you prefer to insert each document into an asynchronous task.

The simplest code might look like this:

for $article in xdmp:http-get($some-url, $options)/moreovercontentdump/article
let $uri := concat('moreover/', $article/@id)
return xdmp:document-insert($uri, $article)

You can improve this code by rewriting some of the original XML. For example, you can reformat the elements harvest_time

and valid_time

in the format xs: dateTime. This way you can create a range index for these values.

Eric Bloch · Answer 2 · 2012-04-05T01:01:22+0000

In general, you will be much better off saving every answer from Many.com in MarkLogic as your own document. In a sense, within MarkLogic, documents are like strings in an RDBMS.

Also, if you insert one of those 30 seconds, I have problems with how it gets up to 6MB per minute of ingestion. Are there any details you left behind?

Optimal way to store XML data in Marlogic

More articles: