Horrible performance analysis of XHTML file with Doctype as XML document

When I parse this xhtml file as xml, it takes about 2 minutes to parse such a simple file. I found that if I remove the doctype declaration it parses almost instantly. What is wrong that makes this file take so long to sort out?

Java example

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware( true );
DocumentBuilder bob = dbf.newDocumentBuilder();
Document template = bob.parse( new InputSource( new FileReader( xmlFile ) ) );

      

XHTML example

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
    xmlns:ex="http://www.example.com/schema/v1_0_0">
    <head><title>Test</title></head>
    <body>
        <h1>Test</h1>
        <p>Hello, World!</p>
        <p><ex:test>Text</ex:test></p>
    </body>
</html>

      

thank

Edit: solution

To fix the issue based on information about why this is happening in the first place, I took the following basic steps:

  • Downloaded the DTD related files to src / main / resources folder
  • Created a custom EntityResolver to read these files from the classpath
  • Told my DocumentBuilder to use my new EntityResolver

I answered this SO answer like this: How do I validate XML using java?

New EntityResolver

import java.io.IOException;

import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class LocalXhtmlDtdEntityResolver implements EntityResolver {

    /* (non-Javadoc)
     * @see org.xml.sax.EntityResolver#resolveEntity(java.lang.String, java.lang.String)
     */
    @Override
    public InputSource resolveEntity( String publicId, String systemId )
            throws SAXException, IOException {
        String fileName = systemId.substring( systemId.lastIndexOf( "/" ) + 1 );    
        return new InputSource( 
                getClass().getClassLoader().getResourceAsStream( fileName ) );
    }

}

      

How to use the new EntityResolver:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware( true );
DocumentBuilder bob = dbf.newDocumentBuilder();
bob.setEntityResolver( new LocalXhtmlDtdEntityResolver() );
Document template = bob.parse( new InputSource( new FileReader( xmlFile ) ) );

      

+3


source to share


2 answers


Java downloads the specified DTD and its and included files to check if your xhtml file matches the specified DTD. Using Charles proxy I recorded the following requests which downloaded the indicated quantities:



+3


source


In fact, you're lucky to have any documents at all. The W3C deliberately does not respond to requests for these documents as they cannot handle the volume of requests. You need to provide a local copy parser.

The usual way to do this in the Java world is by using Apache / Oasis solutions.



The latest version of Saxon has built-in knowledge of these commonly used DTDs and entity files, and if you allow Saxon to supply your XML parser, it will automatically be configured to use local copies. You can probably take advantage of this even if you're not using XSLT or XQuery to process the data: just create a Saxon Configuration object and call its getSourceParser () method to get the XMLReader.

(This might be a good time to get away from the DOM. Of all the many options for handling XML in Java, the DOM is probably the worst. If you must use a low-level navigation API, pick a decent one like JDOM or XOM.)

+2


source







All Articles