Python: searching for Unicode string in HTML with index / find returns wrong position

I am trying to parse the number of results from the HTML returned from a search query, however when I use find / index () it returns the wrong position. The string I'm looking for has an accent, so I'm trying to find it in Unicode form.

Snippet of HTML code snippet:

<div id="WPaging_total">
  Aproximádamente 37 resultados.
</div>

      

and I'm looking for it like this:

str_start = html.index(u'Aproxim\xe1damente ')
str_end = html.find(' resultados', str_start + 16)#len('Aproxim\xe1damente ')==16
print html[str_start+16:str_end] #works by changing 16 to 24

      

The print operator returns:

damente 37

      

When the expected result is:

37

      

It seems str_start does not start at the beginning of the line I am looking for, instead 8 positions back.

print html[str_start:str_start+5]

      

Outputs:

l">

      

The problem is difficult to replicate because it doesn't happen when using a snippet of code, only when searching within the entire HTML string. I could just change str_start + 16 to str_start + 24 to make it work as intended, however that doesn't help me understand the problem. Is this a Unicode problem? Hopefully someone can shed some light on the problem.

Thank.

LINK: http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1

SAMPLE CODE :

from urllib2 import Request, urlopen

url = 'http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1'
post = None
headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2)'}          
req = Request(url, post, headers)
conn = urlopen(req)

html = conn.read()

str_start = html.index(u'Aproxim\xe1damente ')
str_end = html.find(' resultados', str_start + 16)
print html[str_start+16:str_end]

      

+1


source to share


2 answers


Your problem ultimately boils down to the fact that in Python 2.x a type str

represents a sequence of bytes and a type unicode

represents a sequence of characters. Since a single character can be encoded by multiple bytes, this means that the length of a unicode

-type representation of a string may be different from the length of a str

-type representation of the same string and, in the same way, an index in a unicode

string representation may point to a different piece of text than the same index in view str

.

What happens when you do str_start = html.index(u'Aproxim\xe1damente ')

, Python will automatically decode the variable html

, assuming it is utf-8 encoded. (Well, actually, on my PC, I just get UnicodeDecodeError

when I try to execute that line. Some of our system settings related to text encoding should be different.) Hence, if str_start

there is n, it means that it u'Aproxim\xe1damente '

appears in n- m HTML character. However, when you use it as a slice index later to try and get content after the (n + 16) th character, what you actually get is stuff after the (n + 16) th byte, which in this case is not equivalent, because the earlier content of the page contained accented characters that are 2 bytes when encoded in utf-8.



The best solution would be to just convert the html to unicode when you get it. This small modification to your sample code will do what you want, without errors or strange behavior:

from urllib2 import Request, urlopen

url = 'http://guiasamarillas.com.mx/buscador/?actividad=Chedraui&localidad=&id_page=1'
post = None
headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2)'}          
req = Request(url, post, headers)
conn = urlopen(req)

html = conn.read().decode('utf-8')

str_start = html.index(u'Aproxim\xe1damente ')
str_end = html.find(' resultados', str_start + 16)
print html[str_start+16:str_end] 

      

+3


source


It's not entirely clear what you are trying to do, but if I guess correctly that you are trying to extract an approximate number of results from your HTML file, you are probably better off, since you are using its re

regular expression module .

import re
re.search(ur'(?<=Aproxim\xe1damente )\d+', s).group(0)

# returns:
#   u'37'

      



Ultimately your best bet is indeed a package like lxml

or BeautifulSoup

, but without additional context, I can't give you more specific help with these.

0


source







All Articles