Fastest way to load huge .dat into an array
I searched the stackexchange thoroughly for a neat solution to load a huge (~ 2GB) .dat file into a numpy array, but couldn't find the right solution. So far I have managed to load it as a list in a very fast way (<1 minute):
list=[] f = open('myhugefile0') for line in f: list.append(line) f.close()
Using np.loadtxt freezes my computer and takes a few minutes to load (~ 10 minutes). How can I open the file as an array without highlighting the problem that appears to be the bottleneck of np.loadtxt?
EDIT:
The input data is a floating point array (200000.5181). One line example:
2.27069e-15 2.40985e-15 2.22525e-15 2.1138e-15 1.92038e-15 1.54218e-15 1.30739e-15 1.09205e-15 8.53416e-16 7.71566e-16 7.58353e-16 7.58362e- 16 8.81664e -16 1.09204e-15 1.27305e-15 1.58008e-15
etc.
thank
source to share
def load_big_file(fname):
'''only works for well-formed text file of space-separated doubles'''
rows = [] # unknown number of lines, so use list
with open(fname) as f:
for line in f:
line = [float(s) for s in line.split()]
rows.append(np.array(line, dtype = np.double))
return np.vstack(rows) # convert list of vectors to array
An alternative solution, if the number of rows and columns is known, could be:
def load_known_size(fname, nrow, ncol)
x = np.empty((nrow, ncol), dtype = np.double)
with open(fname) as f:
for irow, line in enumerate(f):
for icol, s in enumerate(line.split()):
x[irow, icol] = float(s)
return x
This way, you don't need to select all intermediate lists.
EDIT . It looks like the second solution is a bit slower, the list comprehension is probably faster than the explicit one for the loop. Combining the two solutions and using the trick that Numpy does an implicit conversion from string to float (just discovered now), it might be faster:
def load_known_size(fname, nrow, ncol)
x = np.empty((nrow, ncol), dtype = np.double)
with open(fname) as f:
for irow, line in enumerate(f):
x[irow, :] = line.split()
return x
To get further speed up, you probably have to use code written in C or Cython. I would be interested to know how long they take to open your files.
source to share