Performance of reading a large SPSS file in a pandas frame on Windows 7 (x64)

Question

Performance of reading a large SPSS file in a pandas frame on Windows 7 (x64)

I have a large SPSS file (containing just over 1 million records with just under 150 columns) that I want to convert to a Pandas DataFrame.

It takes a few minutes to convert a file to a list, and a few more minutes to convert it to a dataframe, and a few more minutes to set up columns.

Are there any optimizations that I am missing possible?

import pandas as pd
import numpy as np
import savReaderWriter as spss

raw_data = spss.SavReader('largefile.sav', returnHeader = True) # This is fast
raw_data_list = list(raw_data) # this is slow
data = pd.DataFrame(raw_data_list) # this is slow
data = data.rename(columns=data.loc[0]).iloc[1:] # setting columnheaders, this is slow too.

+3

python pandas spss

bowlby 07 Aug 14 at 11:18

source to share

1 answer

Albert-Jan · Answer 1 · 2014-11-19T16:56:01+0000

You can use rawMode=True

to speed up actions, for example:

raw_data = spss.SavReader('largefile.sav', returnHeader=True, rawMode=True)

This way, datetime variables (if any) won't be converted to ISO strings, and sysmis SPSS $ values won't be converted to None

and a few more things.

Performance of reading a large SPSS file in a pandas frame on Windows 7 (x64)

More articles: