Extract full content from xml using Tika

I want to extract the complete content of an xml file using tika. This means that ticks should not render text from elements and discard labels.

The content output should look like this:

content:
<?xml version="1.0" encoding="UTF-8" ?>
<xml>
    <tag1>text</tag1>
    <tag2>text</tag2>
</xml>

      

But the result is always like this:

content: 





     text
     text

      

Program code:

public static void main(String[] args) {
    try {
        InputStream input;

        input = new FileInputStream(new File("D:/SolrTestFileSystem/Test_Files/test.xml"));

        ContentHandler textHandler = new WriteOutContentHandler();
        Metadata metadata = new Metadata();
        XMLParser parser = new XMLParser();
        ParseContext context = new ParseContext();
        parser.parse(input, textHandler, metadata, context);
        input.close();
        System.out.println("content: " + textHandler.toString());
    } catch (Exception e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
}

      

Xml file:

<?xml version="1.0" encoding="UTF-8" ?>
<xml>
    <tag1>text</tag1>
    <tag2>text</tag2>
</xml>

      

+3


source to share


1 answer


Your problem is that you are using a text content handler to grab the content. If you want to use XML tags, you need to use a content handler that stores them!

(The fact that your content handler is named textHandler

is a hint that the example you took from the plain text required!)



As taken from the Apache Tika example for text and xhtml / xml extraction , your code should be:

import org.apache.tika.sax.ToXMLContentHandler;

InputStream input = TikaInputStream.get(new File("D:/SolrTestFileSystem/Test_Files/test.xml"));
ContentHandler handler = new ToXMLContentHandler();

Metadata metadata = new Metadata();
XMLParser parser = new XMLParser();
ParseContext context = new ParseContext();
parser.parse(input, handler, metadata, context);

input.close();
System.out.println("content: " + handler.toString());

      

0


source







All Articles