Pandas 0.14.1 StataReader - reading .dta files

I am trying to import a large dataset from Stata 13 into pandas using StataReader. This worked fine with pandas 0.13.1, but after I updated to 0.14.1, the ability to read .dta files seemed to deteriorate dramatically. Does anyone know what happened (I couldn't find any changes to the StataReader in the What's New section of the pandas website) and / or how to get around this?

Steps to reproduce my problem:

  • Create a large dataset in Stata 13:

    clear
    
    set obs 11500
    forvalues i = 1/8000{
    gen var`i' = 1
    }
    
    saveold bigdataset, replace
    
          

  • Try to read it in pandas using StataReader:

    from pandas.io.stata import StataReader
    
    reader = StataReader('bigdataset.dta')
    data = reader.data()
    
          

Using pandas 0.13.1 it will take about 220 seconds which is acceptable, but using pandas 0.14.1 nothing happened even after waiting about 20 minutes.

When I test this problem with a smaller dataset:

  • Create a smaller dataset in Stata 13:

    clear
    
    set obs 11500
    forvalues i = 1/1000{
    gen var`i' = 1
    }
    
    saveold smalldataset, replace
    
          

  • Try to read it in pandas using StataReader:

    from pandas.io.stata import StataReader
    
    reader = StataReader('smalldataset.dta')
    data = reader.data()
    
          

Using pandas 0.13.1 it takes about 20 seconds, but using pandas 0.14.1 it takes about 300 seconds.

I would really like to upgrade to a newer version of pandas and work with my data, which is the size of bigdataset.dta. Does anyone know how I can import my data efficiently?

+3


source to share


1 answer


For those who stumble upon this and are interested in an answer, I posted this issue on the pandas Github page as suggested by Roberto and they found and fixed a performance issue. It works great using its master branch right now!



0


source







All Articles