Jsoup parsing and nested tags
I am learning Jsoup and have this HTML:
[...]
<p style="..."> <!-- div 1 -->
Content
</p>
<p style="..."> <!-- div 2 -->
Content
</p>
<p style="..."> <!-- div 3 -->
Content
</p>
[...]
I am using Jsoup.parse () and document selection ("p") to catch "content" (and it works well). But...
[...]
<p style="..."> <!-- div 1 -->
Content
</p>
<p style="..."> <!-- div 2 -->
Content
</p>
<p style="..."> <!-- div 3 -->
Content
<p style="..."></p>
<p style="..."></p>
</p>
[...]
In this scene, I can see that Jsoup.parse () converts this code to:
[...]
<p style="..."> <!-- div 1 -->
Content
</p>
<p style="..."> <!-- div 2 -->
Content
</p>
<p style="..."> <!-- div 3 -->
Content
</p>
<p style="..."> <!-- div 4 -->
</p>
<p style="..."> <!-- div 5 -->
</p>
[...]
How to keep the order of nested paragraphs using Jsoup (div 4 and 5 inside div 3)?
Add example:
HTML file :
<html>
<head>
<title>Title</title>
</head>
<body>
<p style="margin-left:2em">
<span class="one">Text</span>
<span class="two"><span class="nest">Text</span></span>
<span class="three"></span>
</p>
<p style="margin-left:2em">
<span class="one">Text</span>
<span class="two"><span class="nest">Text</span></span>
<span class="three"></span>
</p>
<p style="margin-left:2em">
<span class="one">Text</span>
<span class="two"><span class="nest">Text</span></span>
<span class="three"></span>
<p style="margin-left:2em"></p>
<p style="margin-left:2em"></p>
</p>
</body>
</html>
Java code :
Document doc = null;
doc = Jsoup.connect(URL_with_HTML).get();
System.out.println(doc.outerHtml());
Return
<html>
<head>
<title>Title</title>
</head>
<body>
<p style="margin-left:2em"> <span class="one">Text</span> <span class="two"><span class="nest">Text</span></span> <span class="three"></span> </p>
<p style="margin-left:2em"> <span class="one">Text</span> <span class="two"><span class="nest">Text</span></span> <span class="three"></span> </p>
<p style="margin-left:2em"> <span class="one">Text</span> <span class="two"><span class="nest">Text</span></span> <span class="three"></span> </p>
<p style="margin-left:2em"></p>
<p style="margin-left:2em"></p>
<p></p>
</body>
</html>
Is it correct? I am using Jsoup 1.6.1. I understand that Jsoup is supposed to return nested paragraphs instead of the previous return.
source to share
Nested paragraphs do not exist in HTML. The previous paragraph is closed automatically because Jsoup implements the WHATWG HTML5 specification :
- Tag
p
automatically closes any of the following:address
,article
,aside
,blockquote
,div
,dl
,fieldset
,footer
,h1
,h1
,h4
,h5
,h6
,header
,hgroup
,hr
,main
,menu
,nav
,ol
,p
,pre
,section
,table
orul
. Therefore<p><div></div> becomes <p></p><div></div>
. - An end tag whose name
p
(i.e.</p>
) does not have a matching start tag is a parse error and is replaced with<p>
. Therefore<span></span></p>
it becomes<span></span><p>
.
So jsoup is correct and your HTML is not valid.
Make sure your HTML is invalid because you have too many </p>
, not because the paragraphs are "nested". Nesting cannot happen because they close automatically. But the later transition </p>
is deprecated because the "corresponding" one <p>
was automatically closed earlier.
source to share
Hj, I understand the original question. But I think this is Jsoup's fault (not yours). Since this is a simple example:
<html>
<head></head>
<body>
<p></p>
<p>
<div></div>
</p>
</body>
</html>
But Jsoup parses this:
<html>
<head></head>
<body>
<p></p>
<p></p>
<div></div>
<p></p>
</body>
</html>
If you could, please post this error so the author can fix it :-)
PS: Just the word "hello", stackoverflow doesn't allow this?