Encode_contents vs encode ("utf-8") in Python BeautifulSoup

Question

Encode_contents vs encode ("utf-8") in Python BeautifulSoup

OK, since a beginner webscrapper, I feel like I've seen both used seemingly interchangeably when converting default unicode to HTML. I know content () is a list object, but other than that, what's the difference?

I've noticed that .encode ("utf-8") works more universally.

thank,

-confusion soup.

+3

python beautifulsoup encode

SpicyClubSauce May 21 '15 at 5:37 am

source to share

2 answers

The method signature for encode_contents()

shows that in addition to the encoding content, it can also format the output:

encode_contents(self, indent_level=None, encoding='utf-8', formatter='minimal') method of bs4.BeautifulSoup instance
    Renders the contents of this tag as a bytestring.

For example:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<html><body><p>Caf\xe9</p></body></html>')
>>> soup.encode_contents()
'<html><body><p>Caf\xc3\xa9</p></body></html>'
>>> soup.encode_contents(indent_level=1)
'<html>\n <body>\n  <p>\n   Caf\xc3\xa9\n  </p>\n </body>\n</html>'
>>> soup.encode_contents(indent_level=1, encoding='iso-8859-1')
'<html>\n <body>\n  <p>\n   Caf\xe9\n  </p>\n </body>\n</html>'

str.encode('utf-8')

can only do part of the encoding, no formatting.

+1

mhawke May 21 '15 at 6:07

source to share

salman wahed · Accepted Answer · 2015-05-21T06:22:54+0000

Documentation encode_contents

:

encode_contents(self, indent_level=None, encoding='utf-8', formatter='minimal') method of bs4.BeautifulSoup instance
    Renders the contents of this tag as a bytestring.

Method documentation encode

:

encode(self, encoding='utf-8', indent_level=None, formatter='minimal', errors='xmlcharrefreplace')

encode

the method will work on an instance of an object bs4.BeautifulSoup

. encode_contents

will work with the contents of the instance bs4.BeautifulSoup

.

>>> html = "<div>div content <p> a paragraph </p></div>"
>>> soup = BeautifulSoup(html)
>>> soup.div.encode()
>>> '<div>div content <p> a paragraph </p></div>'
>>> soup.div.contents
>>> [u'div content ', <p> a paragraph </p>]
>>> soup.div.encode_contents()
>>> 'div content <p> a paragraph </p>'

Encode_contents vs encode ("utf-8") in Python BeautifulSoup

More articles: