How are strings stored in python on computers?

I believe that most of you who are familiar with Python have read "Dive Into Python" 3. Chapter 4.3 says this:

In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8 or a Python string encoded as CP-1252. "Is this a UTF-8 string?" is an invalid question.

Somehow I understand what this means: strings = characters in the Unicode set, and Python can help you encode characters according to different encoding methods. However, are characters in Pythons stored as bytes on computers anyway? For example s = 'strings' and s is certainly stored on my computer as strem byte '0100100101 ...' or whatever. Then what exactly is this encoding method used here? Default method for Python?



source to share

1 answer

Python 3 distinguishes between text and binary data. The text is guaranteed to be in Unicode, although no specific encoding is specified as far as I could see. So it could be UTF-8, UTF-16 or UTF-32¹, but you wouldn't even notice.

The main thing here: you care. If you want to deal with text, use text strings and refer to them with a code point (which is the number of a single Unicode character and is independent of internal UTF), which can organize code points in several smaller code units). If you want to use bytes, use b""

and access them by byte. And if you want to have a string in byte sequence in a specific encoding, you use .encode()


¹ Or even UTF-9 if anyone is crazy enough to implement Python on a PDP-10.



All Articles