How to preserve case in jsoup parsing?

I am using jsoup to parse some HTML content. After parsing the HTML content, it changes the camel-bound attributes to lowercase, such as <svg viewBox='XXXX'>

before <svg viewBox='XXXX'>

.

Can someone please tell me how I can keep this case when parsing html content with jsoup 1.8.1?

+3


source to share


2 answers


I just released jsoup 1.10.1 which includes support for tag and / or attribute persistence. You can manage it with ParseSettings . By default, the HTML parser will continue to index lowercase tags and attributes, and the XML parser will keep them. You can specify these parameters when creating a parser.

To use the XML parser (which preserves case by default):

Document doc = Jsoup.parse(xml, baseUrl, Parser.xmlParser());

      



To use the HTML parser and set it to save:

Parser parser = Parser.htmlParser();
parser.settings(new ParseSettings(true, true)); // tag, attribute preserve case
Document doc = parser.parseInput(html, baseUrl);

      

+4


source


When parsing a document, it can be quite difficult to preserve the name attribute of the attribute name. The line responsible for converting all attribute names to lowercase is TokeniserState.java # 649 as of JSoup 1.8.2, and there is no place to insert custom custom code.

The most you can do is load the sources, change the line, and create your own copy of the library.



You should also consider if this strange behavior would result if you didn't convert the attribute names to lowercase. Maybe some problems with Document.getElementByAttribute

or even other dependent functions?

+1


source







All Articles