Unicode (Cyrillic), character indexing, rewriting in python

Question

Unicode (Cyrillic), character indexing, rewriting in python

I work with Russian words written in Cyrillic spelling. Everything works fine, except for how many (but not all) Cyrillic characters are encoded as two characters when in str

. For example:

>>>print [""]
['\xd1\x91']

This won't be a problem unless I want to index the string positions or determine where the character is and replace it with another (say "e"

, no diaresis). Obviously, 2 "characters" are treated as a single whole with the u prefix, as in u""

:

>>>print [u""]
[u'\u0451']

But they str

are passed as variables and therefore cannot be prefixed with u, but unicode()

gives UnicodeDecodeError

(ascii codec cannot decode ...).

So ... how do I get around this? If it helps, I am using python 2.7

+3

python python-2.7 unicode

sautedman 04 Aug At 21:22

source to share

3 answers

In fact, these are different encodings:

>>>print [""]
['\xd1\x91']
>>>print [u""]
[u'\u0451']

What you see is __repr__

for items in lists. No version __str__

for unicode objects.

But strs are passed as variables and therefore cannot be prefixed with u

You mean the data is strings and needs to be converted to unicode type:

>>> for c in [""]: print repr(c)
...
'\xd1\x91'

You need to force double-byte strings to be two-byte wide unicode:

>>> for c in [""]: print repr(unicode(c, 'utf-8'))
...
u'\u0451'

And you will see that they do a great job with this transformation.

+1

Aaron hall 04 Aug 15 at 21:28

source to share

To convert bytes to Unicode, you need to know the appropriate character encoding and call bytes.decode

:

>>> b'\xd1\x91'.decode('utf-8')
u'\u0451'

The encoding depends on the data source. This can be anything, for example, if the data comes from a web page; see Nice way to get encoding / encoding of HTTP response in Python

Don't use non-ascii characters in a byte literal (this is explicitly prohibited in Python 3). Add from __future__ import unicode_literals

to treat all literals "abc"

as Unicode literals.

Note: A single user-readable character can span multiple Unicode code points, for example:

>>> print(u'\u0435\u0308')
̈

+1

jfs 05 Aug 15 at 20:14

source to share

Borealid · Accepted Answer · 2015-08-04T21:28:17+0000

There are two possible situations here.

Either yours str

represents valid encoded data in UTF-8 format or it doesn't.

If it represents valid UTF-8 data, you can convert it to a Unicode object using mystring.decode('utf-8')

. After the instance unicode

, it will be indexed by character, not byte, as you may have noticed.

If it contains invalid byte sequences ... You are in trouble. This is because the question of "what character does this byte represent?" no longer has a clear answer. You will need to decide what exactly you mean when you say "third character" in the presence of byte sequences that don't actually represent a specific Unicode character in UTF-8 at all ...

Perhaps the easiest way to work around this problem is to use a flag ignore_errors

for decode()

. This will completely reverse the invalid byte sequences and only give you the "correct" parts of the string.

Unicode (Cyrillic), character indexing, rewriting in python

More articles: