DjangoUnicodeDecodeError: [Bad Unicode data]
Model:
class ItemType(models.Model):
name = models.CharField(max_length=100)
def __unicode__(self):
logger.debug("1. Item Type %s created" % self.name)
return self.name
Code:
(...)
type = re.search(r"Type:(.*?)",text)
itemtype = ItemType.objects.create(name = name.group(1), defaults={'name':name.group(1)})
logger.debug("2. Item Type %s created" % name.group(1))
logger.debug("4. Item Type %s created" % itemtype.name)
logger.debug("3. Item Type %s created" % itemtype)
And the result is unexpected (for me, of course):
The first logger.debug
prints Item Type ąęńłśóć created
as expected, but the second throws an error:
DjangoUnicodeDecodeError: 'ascii' codec can't decode byte in position :
ordinal not in range(128).
You passed in <ItemType: [Bad Unicode data]> (<class 'aaa.models.ItemType'>)
Why is the error occurring and how can I fix it?
(text is html response using utf-8 encoding)
updated
I add debug to the model and the debug output:
2014-10-06 09:38:53,342 DEBUG views 2. Item Type ąęćńółśż created
2014-10-06 09:38:53,342 DEBUG views 4. Item Type ąęćńółśż created
2014-10-06 09:38:53,344 DEBUG models 1. Item Type ąęćńółśż created
2014-10-06 09:38:53,358 DEBUG models 1. Item Type ąęćńółśż created
so why debug 3. can't print it?
UPDATE 2 The problem is here:
itemtype = ItemType.objects.create(name = name.group(1), defaults={'name':name.group(1)})
if i change it to
itemtype = ItemType.objects.create(name = name.group(1), defaults={'name':u'ĄĆĘŃŁÓŚ'})
everything is good.
So how do I convert it to unicode? unicode (name.group (1)) doesn't work.
source to share
After 2 days of breeding with my own shadow, I found a solution. This is not a workaround for this case, but a tricky change of mind, and I have to refactor all the code.
-
My guess is EVERY LINE is UNICODE. If not, fix it.
-
do not use "% s" or "something" ALWAYS use u "% s" and u "cośtam"
- In every model that has .CharField () models or other "text" oriented fields, I override the save () method:
in the example:
class ItemType(models.Model):
name = models.CharField(max_length=100)
def save(self, *args, **kwargs):
if isinstance(self.name, str):
self.name=self.name.decode("utf-8")
super(ItemType, self).save(*args, **kwargs)
Explanation - if somehow the name is filled with str not unicode - CHANGE it to unicode.
How I found this:
I was wondering what type is the text in models.CharField and found that if you fill it with unicode it is unicode, if you fill in - str - it str. Therefore, if you once filled it with "hand" using unicode, and elsewhere the regular expression filled its str, the result is unexpected.
The biggest problem with unicode and str is that there is no problem using dialect with both:
>>> text_str = "żółć"
>>> text_unicode = u"żółć"
>>> print text_str
żółć
>>> print text_uni
żółć
so you can't see the difference.
But if you use a different command:
>>> text_str
'\xc5\xbc\xc3\xb3\xc5\x82\xc4\x87'
>>> text_uni
u'\u017c\xf3\u0142\u0107'
Glare difference.
if there is some tweak to change the printing behavior (and similiars) to this:
>>> print text_str
'\xc5\xbc\xc3\xb3\xc5\x82\xc4\x87'
>>> print text_uni
żółć
it would be much easier to debug - if you can see the diacritics that's ok - if not - that's bad.
Using decoding ('utf-8') leads me to a solution:
>>> text_str
'\xc5\xbc\xc3\xb3\xc5\x82\xc4\x87'
>>> text_str.decode('utf-8')
u'\u017c\xf3\u0142\u0107'
>>> text_uni
u'\u017c\xf3\u0142\u0107'
VOILA!
source to share