Content

Jsoup parsing and nested tags

I am learning Jsoup and have this HTML:

 [...]
 <p style="..."> <!-- div 1 -->
   Content
 </p>
 <p style="..."> <!-- div 2 -->
   Content
 </p>
 <p style="..."> <!-- div 3 -->
   Content
 </p>
 [...]

      

I am using Jsoup.parse () and document selection ("p") to catch "content" (and it works well). But...

 [...]
 <p style="..."> <!-- div 1 -->
   Content
 </p>
 <p style="..."> <!-- div 2 -->
   Content
 </p>
 <p style="..."> <!-- div 3 -->
   Content
   <p style="..."></p>
   <p style="..."></p>
 </p>
 [...]

      

In this scene, I can see that Jsoup.parse () converts this code to:

 [...]
 <p style="..."> <!-- div 1 -->
   Content
 </p>
 <p style="..."> <!-- div 2 -->
   Content
 </p>
 <p style="..."> <!-- div 3 -->
   Content
 </p>
 <p style="..."> <!-- div 4 -->
 </p>
 <p style="..."> <!-- div 5 -->
 </p>
 [...]

      

How to keep the order of nested paragraphs using Jsoup (div 4 and 5 inside div 3)?


Add example:

HTML file :

 <html>
 <head>
    <title>Title</title>
 </head>
 <body>
    <p style="margin-left:2em">
            <span class="one">Text</span>
            <span class="two"><span class="nest">Text</span></span>
            <span class="three"></span>
    </p>
    <p style="margin-left:2em">
            <span class="one">Text</span>
            <span class="two"><span class="nest">Text</span></span>
            <span class="three"></span>
    </p>
    <p style="margin-left:2em">
            <span class="one">Text</span>
            <span class="two"><span class="nest">Text</span></span>
            <span class="three"></span>
            <p style="margin-left:2em"></p>
            <p style="margin-left:2em"></p>
    </p>

 </body>
 </html>

      

Java code :

Document doc = null;
doc = Jsoup.connect(URL_with_HTML).get();
System.out.println(doc.outerHtml());

      

Return

<html>
<head> 
 <title>Title</title> 
</head> 
<body> 
 <p style="margin-left:2em"> <span class="one">Text</span> <span class="two"><span class="nest">Text</span></span> <span class="three"></span> </p> 
 <p style="margin-left:2em"> <span class="one">Text</span> <span class="two"><span class="nest">Text</span></span> <span class="three"></span> </p> 
 <p style="margin-left:2em"> <span class="one">Text</span> <span class="two"><span class="nest">Text</span></span> <span class="three"></span> </p>
 <p style="margin-left:2em"></p> 
 <p style="margin-left:2em"></p> 
 <p></p>   
</body>
</html>

      

Is it correct? I am using Jsoup 1.6.1. I understand that Jsoup is supposed to return nested paragraphs instead of the previous return.

+3


source to share


2 answers


Nested paragraphs do not exist in HTML. The previous paragraph is closed automatically because Jsoup implements the WHATWG HTML5 specification :

  • Tag p

    automatically closes any of the following: address

    , article

    , aside

    , blockquote

    , div

    , dl

    , fieldset

    , footer

    , h1

    , h1

    , h4

    , h5

    , h6

    , header

    , hgroup

    , hr

    , main

    , menu

    , nav

    , ol

    , p

    , pre

    , section

    , table

    or ul

    . Therefore <p><div></div> becomes <p></p><div></div>

    .
  • An end tag whose name p

    (i.e. </p>

    ) does not have a matching start tag is a parse error and is replaced with <p>

    . Therefore <span></span></p>

    it becomes <span></span><p>

    .


So jsoup is correct and your HTML is not valid.

Make sure your HTML is invalid because you have too many </p>

, not because the paragraphs are "nested". Nesting cannot happen because they close automatically. But the later transition </p>

is deprecated because the "corresponding" one <p>

was automatically closed earlier.

+3


source


Hj, I understand the original question. But I think this is Jsoup's fault (not yours). Since this is a simple example:

<html>
    <head></head>
    <body>
        <p></p>
        <p>
            <div></div>
        </p>
    </body>
</html>

      

But Jsoup parses this:



<html>
    <head></head>
    <body>
        <p></p>
        <p></p>
        <div></div>
        <p></p>
    </body>
</html>

      

If you could, please post this error so the author can fix it :-)

PS: Just the word "hello", stackoverflow doesn't allow this?

0


source







All Articles