Python url decode% E3

Question

Python url decode% E3

I am getting wikipedia url from freebase dump:

url 1: http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brand%25C3%25A3o_Costa

url 2: http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brand%E3o_Costa

They both link to the same wikipedia page:

url 3: http://pt.wikipedia.org/wiki / Pedro_Miguel_de_Castro_Brandão_Costa p>

urllib.unquote

works on url 1

url = 'Pedro_Miguel_de_Castro_Brand%25C3%25A3o_Costa'
url = urllib.unquote(url)
url = urllib.unquote(url)
print url

result

Pedro_Miguel_de_Castro_Brandão_Costa

but doesn't work with url 2.

url = 'Pedro_Miguel_de_Castro_Brand%E3o_Costa'
url = urllib.unquote(url)
print url

result

Pedro_Miguel_de_Castro_Brand o_Costa

Something is wrong?

+3

python encoding character-encoding urllib urldecode

icycandy Dec 19. 14 at 6:59

source to share

1 answer

Ignacio Vazquez-Abrams · Accepted Answer · 2014-12-19T07:07:28+0000

The first is UTF-8 with double quotes, which prints fine since your terminal is using UTF-8. The latter is quoted by Latin-1, which requires decoding first.

>>> print 'Pedro_Miguel_de_Castro_Brand\xe3o_Costa'
Pedro_Miguel_de_Castro_Brand o_Costa
>>> print 'Pedro_Miguel_de_Castro_Brand\xe3o_Costa'.decode('latin-1')
Pedro_Miguel_de_Castro_Brandão_Costa

Python url decode% E3

More articles: