How do I read a portion of a binary file with numpy?

I am converting a matlab script to numpy but am having some problems reading data from binary. Is there an equivelent to fseek

when used fromfile

to skip the beginning of a file? This is the type of extractions I need to do:

fid = fopen(fname);
fseek(fid, 8, 'bof');
second = fread(fid, 1, 'schar');
fseek(fid, 100, 'bof');
total_cycles = fread(fid, 1, 'uint32', 0, 'l');
start_cycle = fread(fid, 1, 'uint32', 0, 'l');

      

Thank!

+11


source to share


3 answers


You can search with a file object in the usual way and then use that file in fromfile

. Here's a complete example:



import numpy as np
import os

data = np.arange(100, dtype=np.int)
data.tofile("temp")  # save the data

f = open("temp", "rb")  # reopen the file
f.seek(256, os.SEEK_SET)  # seek

x = np.fromfile(f, dtype=np.int)  # read the data into numpy
print x 
# [64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
# 89 90 91 92 93 94 95 96 97 98 99]

      

+23


source


Probably the best answer ... But when I ran into this problem, I had a file that I already wanted to get in different parts separately, which gave me an easy solution to this problem.

For example, let's say it chunkyfoo.bin

is a file consisting of a 6-byte header, numpy

a 1024-byte array, and another 1024-byte array numpy

. You can't just open the file and search for 6 bytes (because the first thing numpy.fromfile

does it lseek

back to 0). But you can just mmap

use this file and use instead fromstring

:

with open('chunkyfoo.bin', 'rb') as f:
    with closing(mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ)) as m:
        a1 = np.fromstring(m[6:1030])
        a2 = np.fromstring(m[1030:])

      

This is similar to what you want to do. Also, of course, in real life, the offset and length up to a1

and a2

probably depends on the title, not the fixed comments.



The header is simple m[:6]

and you can parse it by explicitly expanding it using a module struct

or anything else you would do by executing the read

data. But, if you prefer, you can explicitly seek

and read

from f

before constructing m

, or after, or even make the same calls to m

, and it will work without affecting a1

and a2

.

An alternative that I did for another unrelated project numpy

is to create a wrapper file object like:

class SeekedFileWrapper(object):
    def __init__(self, fileobj):
        self.fileobj = fileobj
        self.offset = fileobj.tell()
    def seek(self, offset, whence=0):
        if whence == 0:
            offset += self.offset
        return self.fileobj.seek(offset, whence)
    # ... delegate everything else unchanged

      

I did "delegate everything else unchanged" by creating an list

attribute attribute at build time and using that in __getattr__

, but you probably want something less hacky. numpy

only depends on a few methods of the file object and I think they are properly documented, so just delegate them. But I think the solution mmap

makes more sense here if you're not trying to mechanically wrap a bunch of explicit code seek

. (Do you think that mmap

will also provide you the opportunity to leave it as numpy.memmap

, instead numpy.array

, that allows numpy

to have more control / feedback from paging, etc. But it is actually quite difficult to get a team numpy.memmap

and mmap

to work together.)

+3


source


This is what I do when I have to read arbitrary in a heterogeneous binary.
Numpy allows you to interpret the bit pattern in arbitration mode by modifying the dtype of the array. The Matlab code in the question reads a char

and two uint

.

Read this article (simple reading at the user level, not for scientists) about what can be achieved with changing the dtype, step, dimension of an array.

import numpy as np

data = np.arange(10, dtype=np.int)
data.tofile('f')

x = np.fromfile('f', dtype='u1')
print x.size
# 40

second = x[8]
print 'second', second
# second 2

total_cycles = x[8:12]
print 'total_cycles', total_cycles
total_cycles.dtype = np.dtype('u4')
print 'total_cycles', total_cycles
# total_cycles [2 0 0 0]       !endianness
# total_cycles [2]

start_cycle = x[12:16]
start_cycle.dtype = np.dtype('u4')
print 'start_cycle', start_cycle
# start_cycle [3]

x.dtype = np.dtype('u4')
print 'x', x
# x [0 1 2 3 4 5 6 7 8 9]

x[3] = 423 
print 'start_cycle', start_cycle
# start_cycle [423]

      

+1


source







All Articles