Fastest way to read and split binary data files in Python
I have a processing script that is for loading data binaries like "uint16" and does various processing in chunks of 6400 at a time. The code was originally written in Matlab, but since the analysis codes are written in Python, we wanted to streamline the process by doing everything that was done in Python. The problem is that I noticed that my Python code is quite slower than Matlab's FATF function.
Simply put, the Matlab code looks like this:
fid = fopen(filename);
frame = reshape(fread(fid,80*80,'uint16'),80,80);
Although my Python code is simple:
with open(filename, 'rb') as f:
frame = np.array(unpack("H"*6400, f.read(12800))).reshape(80, 80).astype('float64')
The file size is highly dependent on 500MB -> 400GB, so I find that finding a faster way to parse data in Python can pay dividends on larger files. 500MB usually has ~ 50,000 chunks, and this number increases linearly with file size. The difference in speed I see is roughly equal to:
Python = 4 x 10^-4 seconds / chunk
Matlab = 6.5 x 10^-5 seconds / chunk
The processing shows Matlab over time ~ 5x faster than the Python method I implemented. I've looked into methods like numpy.fromfile and numpy.memmap, but since these methods require the entire file to be opened in memory at some point, it restricts usage as my binaries are quite large. Is there any pythonic method for doing this that I am missing? I would have thought Python would be exceptionally fast when opening + reading binaries. Any advice is much appreciated.
source to share
Write a snippet of the file:
In [117]: dat = np.random.randint(0,1028,80*80).astype(np.uint16)
In [118]: dat.tofile('test.dat')
In [119]: dat
Out[119]: array([266, 776, 458, ..., 519, 38, 840], dtype=uint16)
Import it your way:
In [120]: import struct
In [121]: with open('test.dat','rb') as f:
...: frame = np.array(struct.unpack("H"*6400,f.read(12800)))
...:
In [122]: frame
Out[122]: array([266, 776, 458, ..., 519, 38, 840])
Import from fromfile
In [124]: np.fromfile('test.dat',count=6400,dtype=np.uint16)
Out[124]: array([266, 776, 458, ..., 519, 38, 840], dtype=uint16)
Time comparison:
In [125]: %%timeit
...: with open('test.dat','rb') as f:
...: ...: frame = np.array(struct.unpack("H"*6400,f.read(12800)))
...:
1000 loops, best of 3: 898 ยตs per loop
In [126]: timeit np.fromfile('test.dat',count=6400,dtype=np.uint16)
The slowest run took 5.41 times longe....
10000 loops, best of 3: 36.6 ยตs per loop
fromfile
is much faster.
The time for struct.unpack
, without np.array
is 266 ฮผs; only for f.read
, 23. Thus, unpack
plus is more general and reliable np.array
, which takes much longer. Reading the file itself is not a problem. ( np.array
can handle many kinds of input, lists of lists, lists of objects, etc., so you have to spend more time parsing and evaluating the inputs.)
A slightly faster option for fromfile
is your plus plus frombuffer
:
In [133]: with open('test.dat','rb') as f:
...: frame3 = np.frombuffer(f.read(12800),dtype=np.uint16)
source to share