Fastest way to load huge .dat into an array

Question

Fastest way to load huge .dat into an array

I searched the stackexchange thoroughly for a neat solution to load a huge (~ 2GB) .dat file into a numpy array, but couldn't find the right solution. So far I have managed to load it as a list in a very fast way (<1 minute):

list=[]
f = open('myhugefile0')
for line in f:
    list.append(line)
f.close()

Using np.loadtxt freezes my computer and takes a few minutes to load (~ 10 minutes). How can I open the file as an array without highlighting the problem that appears to be the bottleneck of np.loadtxt?

EDIT:

The input data is a floating point array (200000.5181). One line example:

2.27069e-15 2.40985e-15 2.22525e-15 2.1138e-15 1.92038e-15 1.54218e-15 1.30739e-15 1.09205e-15 8.53416e-16 7.71566e-16 7.58353e-16 7.58362e- 16 8.81664e -16 1.09204e-15 1.27305e-15 1.58008e-15

etc.

thank

+3

python numpy bigdata

cacosomoza 21 oct. '14 at 8:31

source to share

1 answer

Bas swinckels · Accepted Answer · 2014-10-21T10:14:28+0000

Looking at , it looks like it numpy.loadtxt

contains a lot of code to handle many different formats. Once you have a well-defined input file, it is not difficult to write your own function optimized for your specific file format. Something like this (untested):

def load_big_file(fname):
    '''only works for well-formed text file of space-separated doubles'''

    rows = []  # unknown number of lines, so use list
    with open(fname) as f:
        for line in f:
            line = [float(s) for s in line.split()]
            rows.append(np.array(line, dtype = np.double))
    return np.vstack(rows)  # convert list of vectors to array

An alternative solution, if the number of rows and columns is known, could be:

def load_known_size(fname, nrow, ncol)
    x = np.empty((nrow, ncol), dtype = np.double)
    with open(fname) as f:
        for irow, line in enumerate(f):
            for icol, s in enumerate(line.split()):
                x[irow, icol] = float(s)
    return x

This way, you don't need to select all intermediate lists.

EDIT . It looks like the second solution is a bit slower, the list comprehension is probably faster than the explicit one for the loop. Combining the two solutions and using the trick that Numpy does an implicit conversion from string to float (just discovered now), it might be faster:

def load_known_size(fname, nrow, ncol)
    x = np.empty((nrow, ncol), dtype = np.double)
    with open(fname) as f:
        for irow, line in enumerate(f):
            x[irow, :] = line.split()
    return x

To get further speed up, you probably have to use code written in C or Cython. I would be interested to know how long they take to open your files.

Fastest way to load huge .dat into an array

More articles: