DjangoUnicodeDecodeError: [Bad Unicode data]

Model:

class ItemType(models.Model):
  name = models.CharField(max_length=100)
  def __unicode__(self):
    logger.debug("1. Item Type %s created" % self.name)
    return self.name 

      

Code:

  (...)
    type = re.search(r"Type:(.*?)",text)
    itemtype = ItemType.objects.create(name = name.group(1), defaults={'name':name.group(1)})
    logger.debug("2. Item Type %s created" % name.group(1))
    logger.debug("4. Item Type %s created" % itemtype.name)
    logger.debug("3. Item Type %s created" % itemtype)

      

And the result is unexpected (for me, of course):

The first logger.debug

prints Item Type ąęńłśóć created

as expected, but the second throws an error:

DjangoUnicodeDecodeError: 'ascii' codec can't decode byte  in position : 
ordinal not in range(128). 
You passed in <ItemType: [Bad Unicode data]> (<class 'aaa.models.ItemType'>)

      

Why is the error occurring and how can I fix it?

(text is html response using utf-8 encoding)

updated

I add debug to the model and the debug output:

2014-10-06 09:38:53,342 DEBUG views 2. Item Type ąęćńółśż created
2014-10-06 09:38:53,342 DEBUG views 4. Item Type ąęćńółśż created
2014-10-06 09:38:53,344 DEBUG models 1. Item Type ąęćńółśż created
2014-10-06 09:38:53,358 DEBUG models 1. Item Type ąęćńółśż created

      

so why debug 3. can't print it?

UPDATE 2 The problem is here:

  itemtype = ItemType.objects.create(name = name.group(1), defaults={'name':name.group(1)})

      

if i change it to

  itemtype = ItemType.objects.create(name = name.group(1), defaults={'name':u'ĄĆĘŃŁÓŚ'})

      

everything is good.

So how do I convert it to unicode? unicode (name.group (1)) doesn't work.

+3


source to share


1 answer


After 2 days of breeding with my own shadow, I found a solution. This is not a workaround for this case, but a tricky change of mind, and I have to refactor all the code.

  • My guess is EVERY LINE is UNICODE. If not, fix it.

  • do not use "% s" or "something" ALWAYS use u "% s" and u "cośtam"

  • In every model that has .CharField () models or other "text" oriented fields, I override the save () method:

in the example:

class ItemType(models.Model):
  name = models.CharField(max_length=100)

  def save(self, *args, **kwargs):
    if isinstance(self.name, str):
      self.name=self.name.decode("utf-8")
    super(ItemType, self).save(*args, **kwargs)

      

Explanation - if somehow the name is filled with str not unicode - CHANGE it to unicode.

How I found this:

I was wondering what type is the text in models.CharField and found that if you fill it with unicode it is unicode, if you fill in - str - it str. Therefore, if you once filled it with "hand" using unicode, and elsewhere the regular expression filled its str, the result is unexpected.

The biggest problem with unicode and str is that there is no problem using dialect with both:

>>> text_str = "żółć"
>>> text_unicode = u"żółć"
>>> print text_str
żółć
>>> print text_uni
żółć

      

so you can't see the difference.



But if you use a different command:

>>> text_str
'\xc5\xbc\xc3\xb3\xc5\x82\xc4\x87'
>>> text_uni
u'\u017c\xf3\u0142\u0107'

      

Glare difference.

if there is some tweak to change the printing behavior (and similiars) to this:

>>> print text_str
'\xc5\xbc\xc3\xb3\xc5\x82\xc4\x87'
>>> print text_uni
żółć

      

it would be much easier to debug - if you can see the diacritics that's ok - if not - that's bad.

Using decoding ('utf-8') leads me to a solution:

>>> text_str
'\xc5\xbc\xc3\xb3\xc5\x82\xc4\x87'
>>> text_str.decode('utf-8')
u'\u017c\xf3\u0142\u0107'
>>> text_uni
u'\u017c\xf3\u0142\u0107'

      

VOILA!

+2


source







All Articles