Urlopen, BeautifulSoup and UTF-8
I'm just trying to get a webpage, but somehow the foreign character is embedded in the HTML file. This symbol is not displayed when I use View Source.
isbn = 9780141187983
url = "http://search.barnesandnoble.com/booksearch/isbninquiry.asp?ean=%s" % isbn
opener = urllib2.build_opener()
url_opener = opener.open(url)
page = url_opener.read()
html = BeautifulSoup(page)
html #This line causes error.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 21555: ordinal not in range(128)
I tried it too ...
html = BeautifulSoup(page.encode('utf-8'))
How can I read this webpage in BeautifulSoup without getting this error?
+2
source to share
2 answers
This error probably happens when you are trying to print a representation of a BeautifulSoup file, which will happen automatically if I suspect you are running in an interactive console.
# This code will work fine, note we are assigning the result
# of the BeautifulSoup object to prevent it from printing immediately.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(u'\xa0')
# This will probably show the error you saw
print soup
# And this would probably be fine
print soup.encode('utf-8')
+11
source to share