Chinese characters become gibberish when using pandas read_stata () function

I am trying to read Stata.dta file with python pandas package using read_stata () function and there are many Chinese characters in dta file. The file that was being read had all the bad codes and the Chinese characters were just gibberish. Any suggestions?

+3


source to share


1 answer


You need to specify the codec to use, by default it will decode the text as ISO-8859-1 (Latin-1):

pandas.read_stata(filename, encoding=codec_to_use)

      

See pandas.read_stata()

documenation
:



encoding : string, None or encoding
Encoding used to parse files. Please note that Stata does not support unicode. None

the default is iso-8859-1.

For the Chinese, I assume that the codec used is a codec gb*

( gb18030

, gbk

, gb2312

), or the UTF codec ( UTF-8

, UTF-16

or UTF-32

). Despite the comment in the Panda doc above, I see that Stata 14 now supports Unicode and that they are using UTF-8 to do so.

Also see the Standard Encodings page for an overview of the supported codecs.

+3


source







All Articles