Indian Language Handling in BeautifulSoup

Question

Indian Language Handling in BeautifulSoup

I am trying to clear the NDTV site for news. This is the page I'm using as my HTML source. I am using BeautifulSoup (bs4) to process HTML code and everything works for me except my code when I come across Hindi titles on the page I linked to.

My code so far:

import urllib2
from bs4 import BeautifulSoup

htmlUrl = "http://archives.ndtv.com/articles/2012-01.html"
FileName = "NDTV_2012_01.txt"

fptr = open(FileName, "w")
fptr.seek(0)

page = urllib2.urlopen(htmlUrl)
soup = BeautifulSoup(page, from_encoding="UTF-8")

li = soup.findAll( 'li')
for link_tag in li:
   hypref = link_tag.find('a').contents[0]
   strhyp = str(hypref)
   fptr.write(strhyp)
   fptr.write("\n")

The error I am getting:

Traceback (most recent call last):
  File "./ScrapeTemplate.py", line 30, in <module>
  strhyp = str(hypref)
  UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

I got the same error even though I didn't include the option from_encoding

. I originally used it like fromEncoding

, but python warned me that this is deprecated usage.

How to fix it? From what I've read, I need to either avoid Hindi headers or explicitly encode it to non-ascii text, but I don't know how. Any help would be greatly appreciated!

+3

python web-scraping beautifulsoup

Kitchi 19 jan. '13 at 9:24

source to share

2 answers

strhyp = hypref.encode('utf-8')

http://joelonsoftware.com/articles/Unicode.html

+1

Pavel Anossov 19 jan. '13 at 9:28

source to share

Andreas Jung · Accepted Answer · 2013-01-19T09:32:50+0000

What you see is a NavigableString instance (which derives from the Python Unicode type):

(Pdb) hypref.encode('utf-8')
'NDTV'
(Pdb) hypref.__class__
<class 'bs4.element.NavigableString'>
(Pdb) hypref.__class__.__bases__
(<type 'unicode'>, <class 'bs4.element.PageElement'>)

You need to convert to utf-8 with

hypref.encode('utf-8')

Indian Language Handling in BeautifulSoup

More articles: