Chinese characters become gibberish when using pandas read_stata () function
You need to specify the codec to use, by default it will decode the text as ISO-8859-1 (Latin-1):
pandas.read_stata(filename, encoding=codec_to_use)
See pandas.read_stata()
documenation :
encoding : string, None or encoding
Encoding used to parse files. Please note that Stata does not support unicode.None
the default is iso-8859-1.
For the Chinese, I assume that the codec used is a codec gb*
( gb18030
, gbk
, gb2312
), or the UTF codec ( UTF-8
, UTF-16
or UTF-32
). Despite the comment in the Panda doc above, I see that Stata 14 now supports Unicode and that they are using UTF-8 to do so.
Also see the Standard Encodings page for an overview of the supported codecs.
source to share