Fastest way to read and split binary data files in Python

Question

Fastest way to read and split binary data files in Python

I have a processing script that is for loading data binaries like "uint16" and does various processing in chunks of 6400 at a time. The code was originally written in Matlab, but since the analysis codes are written in Python, we wanted to streamline the process by doing everything that was done in Python. The problem is that I noticed that my Python code is quite slower than Matlab's FATF function.

Simply put, the Matlab code looks like this:

fid = fopen(filename); 
frame = reshape(fread(fid,80*80,'uint16'),80,80);

Although my Python code is simple:

with open(filename, 'rb') as f: 
    frame = np.array(unpack("H"*6400, f.read(12800))).reshape(80, 80).astype('float64')

The file size is highly dependent on 500MB -> 400GB, so I find that finding a faster way to parse data in Python can pay dividends on larger files. 500MB usually has ~ 50,000 chunks, and this number increases linearly with file size. The difference in speed I see is roughly equal to:

Python = 4 x 10^-4 seconds / chunk

Matlab = 6.5 x 10^-5 seconds / chunk

The processing shows Matlab over time ~ 5x faster than the Python method I implemented. I've looked into methods like numpy.fromfile and numpy.memmap, but since these methods require the entire file to be opened in memory at some point, it restricts usage as my binaries are quite large. Is there any pythonic method for doing this that I am missing? I would have thought Python would be exceptionally fast when opening + reading binaries. Any advice is much appreciated.

+3

performance python python-2.7 numpy matlab

Dustin K. May 24 '17 at 22:05

source to share

1 answer

hpaulj · Accepted Answer · 2017-05-25T06:02:15+0000

Write a snippet of the file:

In [117]: dat = np.random.randint(0,1028,80*80).astype(np.uint16)
In [118]: dat.tofile('test.dat')
In [119]: dat
Out[119]: array([266, 776, 458, ..., 519,  38, 840], dtype=uint16)

Import it your way:

In [120]: import struct
In [121]: with open('test.dat','rb') as f:
     ...:     frame = np.array(struct.unpack("H"*6400,f.read(12800)))
     ...:     
In [122]: frame
Out[122]: array([266, 776, 458, ..., 519,  38, 840])

Import from fromfile

In [124]: np.fromfile('test.dat',count=6400,dtype=np.uint16)
Out[124]: array([266, 776, 458, ..., 519,  38, 840], dtype=uint16)

Time comparison:

In [125]: %%timeit
     ...:  with open('test.dat','rb') as f:
     ...:      ...:     frame = np.array(struct.unpack("H"*6400,f.read(12800)))
     ...: 
1000 loops, best of 3: 898 µs per loop

In [126]: timeit np.fromfile('test.dat',count=6400,dtype=np.uint16)
The slowest run took 5.41 times longe....
10000 loops, best of 3: 36.6 µs per loop

fromfile

is much faster.

The time for struct.unpack

, without np.array

is 266 μs; only for f.read

, 23. Thus, unpack

plus is more general and reliable np.array

, which takes much longer. Reading the file itself is not a problem. ( np.array

can handle many kinds of input, lists of lists, lists of objects, etc., so you have to spend more time parsing and evaluating the inputs.)

A slightly faster option for fromfile

is your plus plus frombuffer

:

In [133]: with open('test.dat','rb') as f:
     ...:      frame3 = np.frombuffer(f.read(12800),dtype=np.uint16)

Fastest way to read and split binary data files in Python

More articles: