Python url decode% E3

I am getting wikipedia url from freebase dump:

url 1: http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brand%25C3%25A3o_Costa

url 2: http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brand%E3o_Costa

They both link to the same wikipedia page:

url 3: http://pt.wikipedia.org/wiki / Pedro_Miguel_de_Castro_Brandão_Costa p>

urllib.unquote

works on url 1

url = 'Pedro_Miguel_de_Castro_Brand%25C3%25A3o_Costa'
url = urllib.unquote(url)
url = urllib.unquote(url)
print url

      

result

Pedro_Miguel_de_Castro_Brandão_Costa

      

but doesn't work with url 2.

url = 'Pedro_Miguel_de_Castro_Brand%E3o_Costa'
url = urllib.unquote(url)
print url

      

result

Pedro_Miguel_de_Castro_Brand o_Costa    

      

Something is wrong?

+3


source to share


1 answer


The first is UTF-8 with double quotes, which prints fine since your terminal is using UTF-8. The latter is quoted by Latin-1, which requires decoding first.



>>> print 'Pedro_Miguel_de_Castro_Brand\xe3o_Costa'
Pedro_Miguel_de_Castro_Brand o_Costa
>>> print 'Pedro_Miguel_de_Castro_Brand\xe3o_Costa'.decode('latin-1')
Pedro_Miguel_de_Castro_Brandão_Costa

      

+4


source







All Articles