Fast and reliable technique for changing buffer bytes in Ruby?

I want to read the contents of a binary file, do a binary NOT on every byte in the file buffer, and then write the modified buffer back to another file on disk. I am currently using something like the following:

data = nil

::File.open( 'somefile.bin', 'rb' ) do | f |
    data = f.read( f.stat.size )
end

# unpack can sometimes throw an out of memory exception
raw_bytes = data.unpack( 'C*' )

raw_bytes.map! do | byte |
    ~byte
end

::File.open( 'somefile.bin.not', 'wb' ) do | f |
    f.write( raw_bytes.pack( 'C*' ) )
end

      

This works, however unpacking sometimes throws an out of memory exception. Is it possible to edit the buffer data

directly without resorting to unpacking it into an array (I decided to do this so I can use the map! To change the bytes).

Since this needs to be done on 100 thousand files (all files are = 30 MB), it is important. The above solution works fine but is not reliable due to memory issue. I believe that avoiding unpacking and modifying the data buffer directly can avoid this.

Can anyone improve my existing solution? Many thanks.

+3


source to share


2 answers


I believe that avoiding unpacking and modifying the data buffer directly can avoid this.

Your data buffer is a binary string, that is, a sequence of characters in the range 0x00 to 0xFF. You can flip the character bits by converting them to the inverse range 0xFF to 0x00:

0x00 (00000000) -> 0xFF (11111111)
0x01 (00000001) -> 0xFE (11111110)
0x02 (00000010) -> 0xFD (11111101)
0x03 (00000011) -> 0xFC (11111100)
...
0x7E (01111110) -> 0x81 (10000001)
0x7F (01111111) -> 0x80 (10000000)
0x80 (10000000) -> 0x7F (01111111)
0x81 (10000001) -> 0x7E (01111110)
...
0xFC (11111100) -> 0x03 (00000011)
0xFD (11111101) -> 0x02 (00000010)
0xFE (11111110) -> 0x01 (00000001)
0xFF (11111111) -> 0x00 (00000000)

      

The fastest way to apply character collation to a character is probably String#tr

. You just pass two strings a

and b

, and tr

replaces all characters from the a

corresponding characters in b

.

a = (0..255).map(&:chr).join #=> "\x00\x01\x02...\xFD\xFE\xFF"
b = a.reverse                #=> "\xFF\xFE\xFD...\x02\x01\x00"

      

Since "-"

and "\\"

are of particular importance in tr

, they must be escaped:



a.gsub!(/[\\-]/, '\\\\\0')
b.gsub!(/[\\-]/, '\\\\\0')

      

Let's see how this is done:

require 'benchmark'

@data = IO.read('/dev/random', 30_000_000)

@a = (0..255).map(&:chr).join
@b = @a.reverse

@a.gsub!(/[\\-]/, '\\\\\0')
@b.gsub!(/[\\-]/, '\\\\\0')

Benchmark.bm(5) do |x|
  x.report("pack:") { @data.unpack('C*').map(&:~).pack('C*') }
  x.report("tr:")   { @data.tr(@a, @b) }
end

      

Results:

            user     system      total        real
pack:   4.780000   0.150000   4.930000 (  5.082274)
tr:     0.070000   0.000000   0.070000 (  0.078761)

      

+4


source


I tried to read 1mb every time instead of keeping everything in memory. In my tests, I haven't crashed any of the versions, so I can't be sure it won't work, but chances are it won't. As a bonus, I also managed to get a modest 5% increase in performance (don't ask me how xD), according to my tests. Here he is:

File.open( 'somefile.bin', 'rb' ) do | file |
    File.open( 'somefile.bin.not', 'wb' ) do | out |
        until file.eof?
            buffer = file.read( 1024*1024 ).unpack( 'C*' ).map do | byte |
                ~byte
            end

            out.write( buffer.pack( 'C*' ) )
        end
    end
end

      



It would be nice if you could test it in your environment and tell me how it turned out afterwards.

+1


source







All Articles