BeautifulSoup (bs4) parsing wrong

Question

BeautifulSoup (bs4) parsing wrong

Parsing this sample document with bs4, from python 2.7.6:

<html>
<body>
<p>HTML allows omitting P end-tags.

<p>Like that and this.

<p>And this, too.

<p>What happened?</p>

<p>And can we <p>nest a paragraph, too?</p></p>

</body>
</html>

Using:

from bs4 import BeautifulSoup as BS
...
tree = BS(fh)

HTML has for centuries allowed omitted end tags for various types of elements, including P (schema or parser validation). However, the bs4 prettify () in this document shows that it doesn't end any of those paragraphs until it sees the </body>:

<html>
 <body>
  <p>
   HTML allows omitting P end-tags.
   <p>
    Like that and this.
    <p>
     And this, too.
     <p>
      What happened?
     </p>
     <p>
      And can we
      <p>
       nest a paragraph, too?
      </p>
     </p>
    </p>
   </p>
  </p>
 </body>

It is not prefixed with () because manually traversing the tree I end up with the same structure:

<[document]>
    <html>
        ␊
        <body>
            ␊
            <p>
                HTML allows omitting P end-tags.␊␊
                <p>
                    Like that and this.␊␊
                    <p>
                        And this, too.␊␊
                        <p>
                            What happened?
                        </p>
                        ␊
                        <p>
                            And can we 
                            <p>
                                nest a paragraph, too?
                            </p>
                        </p>
                        ␊
                    </p>
                </p>
            </p>
        </body>
        ␊
    </html>
    ␊
</[document]>

This will now be the correct result for XML (at least until </body>, after which it should report the error to WF). But this is not XML. What gives?

+3

python html python-2.7 bs4

TextGeek Apr 29. 15 at 20:45

source to share

1 answer

TextGeek · Accepted Answer · 2015-05-06T17:28:46+0000

The doc at http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser explains how to get BS4 to use different parsers. Apparently the default is html.parse, which the BS4 doc says is broken before Python 2.7.3, but apparently still has the problem outlined above in 2.7.6.

Switching to "lxml" was unfortunate for me, but switching to "html5lib" produces the correct output:

tree = BS(htmSource, "html5lib")

BeautifulSoup (bs4) parsing wrong

More articles: