Python url decode% E3
I am getting wikipedia url from freebase dump:
url 1: http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brand%25C3%25A3o_Costa
url 2: http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brand%E3o_Costa
They both link to the same wikipedia page:
url 3: http://pt.wikipedia.org/wiki / Pedro_Miguel_de_Castro_Brandão_Costa p>
urllib.unquote
works on url 1
url = 'Pedro_Miguel_de_Castro_Brand%25C3%25A3o_Costa' url = urllib.unquote(url) url = urllib.unquote(url) print url
result
Pedro_Miguel_de_Castro_Brandão_Costa
but doesn't work with url 2.
url = 'Pedro_Miguel_de_Castro_Brand%E3o_Costa' url = urllib.unquote(url) print url
result
Pedro_Miguel_de_Castro_Brand o_Costa
Something is wrong?
+3
source to share
1 answer
The first is UTF-8 with double quotes, which prints fine since your terminal is using UTF-8. The latter is quoted by Latin-1, which requires decoding first.
>>> print 'Pedro_Miguel_de_Castro_Brand\xe3o_Costa'
Pedro_Miguel_de_Castro_Brand o_Costa
>>> print 'Pedro_Miguel_de_Castro_Brand\xe3o_Costa'.decode('latin-1')
Pedro_Miguel_de_Castro_Brandão_Costa
+4
source to share