Fast and reliable technique for changing buffer bytes in Ruby?

Question

Fast and reliable technique for changing buffer bytes in Ruby?

I want to read the contents of a binary file, do a binary NOT on every byte in the file buffer, and then write the modified buffer back to another file on disk. I am currently using something like the following:

data = nil

::File.open( 'somefile.bin', 'rb' ) do | f |
    data = f.read( f.stat.size )
end

# unpack can sometimes throw an out of memory exception
raw_bytes = data.unpack( 'C*' )

raw_bytes.map! do | byte |
    ~byte
end

::File.open( 'somefile.bin.not', 'wb' ) do | f |
    f.write( raw_bytes.pack( 'C*' ) )
end

This works, however unpacking sometimes throws an out of memory exception. Is it possible to edit the buffer data

directly without resorting to unpacking it into an array (I decided to do this so I can use the map! To change the bytes).

Since this needs to be done on 100 thousand files (all files are = 30 MB), it is important. The above solution works fine but is not reliable due to memory issue. I believe that avoiding unpacking and modifying the data buffer directly can avoid this.

Can anyone improve my existing solution? Many thanks.

+3

ruby

mrmax 26 nov. '14 at 9:38

source to share

2 answers

I tried to read 1mb every time instead of keeping everything in memory. In my tests, I haven't crashed any of the versions, so I can't be sure it won't work, but chances are it won't. As a bonus, I also managed to get a modest 5% increase in performance (don't ask me how xD), according to my tests. Here he is:

File.open( 'somefile.bin', 'rb' ) do | file |
    File.open( 'somefile.bin.not', 'wb' ) do | out |
        until file.eof?
            buffer = file.read( 1024*1024 ).unpack( 'C*' ).map do | byte |
                ~byte
            end

            out.write( buffer.pack( 'C*' ) )
        end
    end
end

It would be nice if you could test it in your environment and tell me how it turned out afterwards.

+1

SlySherZ 26 nov. 14 at 12:01

source to share

Stefan · Accepted Answer · 2014-11-26T13:15:59+0000

I believe that avoiding unpacking and modifying the data buffer directly can avoid this.

Your data buffer is a binary string, that is, a sequence of characters in the range 0x00 to 0xFF. You can flip the character bits by converting them to the inverse range 0xFF to 0x00:

0x00 (00000000) -> 0xFF (11111111)
0x01 (00000001) -> 0xFE (11111110)
0x02 (00000010) -> 0xFD (11111101)
0x03 (00000011) -> 0xFC (11111100)
...
0x7E (01111110) -> 0x81 (10000001)
0x7F (01111111) -> 0x80 (10000000)
0x80 (10000000) -> 0x7F (01111111)
0x81 (10000001) -> 0x7E (01111110)
...
0xFC (11111100) -> 0x03 (00000011)
0xFD (11111101) -> 0x02 (00000010)
0xFE (11111110) -> 0x01 (00000001)
0xFF (11111111) -> 0x00 (00000000)

The fastest way to apply character collation to a character is probably String#tr

. You just pass two strings a

and b

, and tr

replaces all characters from the a

corresponding characters in b

.

a = (0..255).map(&:chr).join #=> "\x00\x01\x02...\xFD\xFE\xFF"
b = a.reverse                #=> "\xFF\xFE\xFD...\x02\x01\x00"

Since "-"

and "\\"

are of particular importance in tr

, they must be escaped:

a.gsub!(/[\\-]/, '\\\\\0')
b.gsub!(/[\\-]/, '\\\\\0')

Let's see how this is done:

require 'benchmark'

@data = IO.read('/dev/random', 30_000_000)

@a = (0..255).map(&:chr).join
@b = @a.reverse

@a.gsub!(/[\\-]/, '\\\\\0')
@b.gsub!(/[\\-]/, '\\\\\0')

Benchmark.bm(5) do |x|
  x.report("pack:") { @data.unpack('C*').map(&:~).pack('C*') }
  x.report("tr:")   { @data.tr(@a, @b) }
end

Results:

            user     system      total        real
pack:   4.780000   0.150000   4.930000 (  5.082274)
tr:     0.070000   0.000000   0.070000 (  0.078761)

Fast and reliable technique for changing buffer bytes in Ruby?

More articles: