Pandas 0.14.1 StataReader - reading .dta files

Question

Pandas 0.14.1 StataReader - reading .dta files

I am trying to import a large dataset from Stata 13 into pandas using StataReader. This worked fine with pandas 0.13.1, but after I updated to 0.14.1, the ability to read .dta files seemed to deteriorate dramatically. Does anyone know what happened (I couldn't find any changes to the StataReader in the What's New section of the pandas website) and / or how to get around this?

Steps to reproduce my problem:

Create a large dataset in Stata 13:

clear

set obs 11500
forvalues i = 1/8000{
gen var`i' = 1
}

saveold bigdataset, replace

Try to read it in pandas using StataReader:

from pandas.io.stata import StataReader

reader = StataReader('bigdataset.dta')
data = reader.data()

Using pandas 0.13.1 it will take about 220 seconds which is acceptable, but using pandas 0.14.1 nothing happened even after waiting about 20 minutes.

When I test this problem with a smaller dataset:

Create a smaller dataset in Stata 13:

clear

set obs 11500
forvalues i = 1/1000{
gen var`i' = 1
}

saveold smalldataset, replace

Try to read it in pandas using StataReader:

from pandas.io.stata import StataReader

reader = StataReader('smalldataset.dta')
data = reader.data()

Using pandas 0.13.1 it takes about 20 seconds, but using pandas 0.14.1 it takes about 300 seconds.

I would really like to upgrade to a newer version of pandas and work with my data, which is the size of bigdataset.dta. Does anyone know how I can import my data efficiently?

+3

python python-2.7 pandas stata

David Aug 14 14 at 22:14

source to share

1 answer

David · Answer 1 · 2014-08-19T20:55:25+0000

For those who stumble upon this and are interested in an answer, I posted this issue on the pandas Github page as suggested by Roberto and they found and fixed a performance issue. It works great using its master branch right now!

Pandas 0.14.1 StataReader - reading .dta files

More articles: