Chinese characters become gibberish when using pandas read_stata () function

Question

Chinese characters become gibberish when using pandas read_stata () function

I am trying to read Stata.dta file with python pandas package using read_stata () function and there are many Chinese characters in dta file. The file that was being read had all the bad codes and the Chinese characters were just gibberish. Any suggestions?

+3

python pandas stata

Olivier ma 10 Aug 15 at 7:21

source to share

1 answer

Martijn pieters · Accepted Answer · 2015-08-10T07:28:12+0000

You need to specify the codec to use, by default it will decode the text as ISO-8859-1 (Latin-1):

pandas.read_stata(filename, encoding=codec_to_use)

See pandas.read_stata()

documenation :

encoding : string, None or encoding
Encoding used to parse files. Please note that Stata does not support unicode. None

the default is iso-8859-1.

For the Chinese, I assume that the codec used is a codec gb*

( gb18030

, gbk

, gb2312

), or the UTF codec ( UTF-8

, UTF-16

or UTF-32

). Despite the comment in the Panda doc above, I see that Stata 14 now supports Unicode and that they are using UTF-8 to do so.

Also see the Standard Encodings page for an overview of the supported codecs.

Chinese characters become gibberish when using pandas read_stata () function

More articles: