Best way to extract large xml block from large xml file

Question

Best way to extract large xml block from large xml file

I am extracting large blocks from XML files using XPath. My xml files are big, they are from PubMed. An example of my file type:

ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/medline17n0001.xml.gz

So using

 Node result = (Node)xPath.evaluate("PubmedArticleSet/PubmedArticle[MedlineCitation/PMID = "+PMIDtoSearch+"]", doc, XPathConstants.NODE);

I am getting an article with PMIDtoSearch, so it is perfect. But it takes a long time. I have to do this about 800,000 times, so it will take over two months with this solution. Some blocks are over 400 lines and each xml file is over 4 million lines.

I've also tried a solution like this function getElementsByTagName

, but it takes almost one time.

Do you know how to improve the solution?

Thank.

+3

java xml xquery xpath sax

César 31 jul. 17 at 8:48

source to share

3 answers

As @KevinBrown points out, the database may very well be the right answer. But if it's a one-off process, there are probably solutions out there that are much faster than yours, but don't require the hard work of learning to set up an XML database.

There are two main costs in the approach you are using: parsing XML documents to create an in-memory tree, and then searching the document in memory to find a specific identifier value. I would guess that the cost of parsing is probably an order of magnitude more than the cost of searching.

So there are two components to this:

first you need to make sure that you only parse each source document once (and not once per request). You haven't told us enough so that I can tell if you are doing it now.
second, if you are retrieving many pieces of data from a single document, you want to do so without having to search for each one in a row. The best way to accomplish this is to use a query processor that builds an index to optimize the query (like Saxon-EE). In addition, you can create indexes “manually”, for example, using XQuery 3.1 maps or using the xsl: key function in XSLT.

+2

Michael kay 01 Aug 17 at 21:56

source to share

This is the code that does the xpath query .. on my laptop, the results look decent .. it took about 1 second regardless of the pmid value. How are you going to extract the text. I can update the code to target it.

public static void main(String[] args) throws VTDException{
        VTDGen vg = new VTDGen();
        if (!vg.parseFile("d:\\xml\\medline17n0001.xml", false))
            return;
        VTDNav vn = vg.getNav();
        AutoPilot ap = new AutoPilot(vn);
        System.out.println("nesting level"+vn.getNestingLevel());
        String PMIDtoSearch =  "30000";
        ap.selectXPath("/PubmedArticleSet/PubmedArticle[MedlineCitation/PMID = "+PMIDtoSearch+"]");
        System.out.println("====>"+ap.getExprString());
        int i=0,count=0;
        System.out.println(" token count ====> "+ vn.getTokenCount() );
        while((i=ap.evalXPath())!=-1){
            count++;
            System.out.println("string ====>"+vn.toString(i));
        }
        System.out.println(" count ===> "+count);
    }

0

vtd-xml-author 01 Aug 17 at 12:17 am

source to share

Kevin brown · Accepted Answer · 2017-08-01T01:55:11+0000

I took your document and loaded into the existing -db, then ran your request, essentially this:

xquery version "3.0";
let $medline := '/db/Medline/Data'
let $doc := 'medline17n0001.xml'
let $PMID := request:get-parameter("PMID", "")
let $article := doc(concat($medline,'/',$doc))/PubmedArticleSet/PubmedArticle[MedlineCitation/PMID=$PMID]
return
$article

The document is returned in 400 milliseconds from the remote server. If I increased this server, I would expect less of this and it could handle multiple concurrent requests. Or if you had everything locally even faster.

Try it yourself, I left the data on a test server (and remember, this is a remote access request to an Amazon microservice in California):

http://54.241.15.166/get-article2.xq?PMID=8

http://54.241.15.166/get-article2.xq?PMID=6

http://54.241.15.166/get-article2.xq?PMID=1

And, of course, this entire document is there. You can simply change this request to PMID = 667 or 999 or whatever and return the target document chunk.

Best way to extract large xml block from large xml file

More articles: