Recursively search a directory of binaries for a hex sequence?

The current commands I use to find some hex values ​​(for example 0A 8b 02

) include:

find . -type f -not -name "*.png" -exec xxd -p {} \; | grep "0a8b02" || xargs -0 -P 4

Can this be improved by considering the following goals:

  • find files recursively
  • display offset and filename
  • exclude certain files with certain extensions (the example above will not search for files .png

    )
  • speed: search should handle 200,000 files (50KB to 1MB) for a total of ~ 2GB.

I'm not too sure if it xargs

works correctly for 4 CPUs. Also I find it hard to print the filename when it grep

finds a match since it is sent from xxd

. Any suggestions?

+3


source to share


1 answer


IF:

  • Do you have GNU grep

  • And the hex bytes you are looking for NEVER contain newlines ( 0xa

    ) [1]
    • If they contain NUL ( 0x

      ), you must provide the search string grep

      via file ( -f

      ), not a direct argument.

the following command will get you there using the lookup example 0e 8b 02

:

LC_ALL=C find . -type f -not -name "*.png" -exec grep -FHoab $'\x{0e}\x{8b}\x{02}' {} + |
  LC_ALL=C cut -d: -f1-2

      

The command grep

creates output lines as follows:

<filename>:<byte-offset>:<matched-bytes>

      

which LC_ALL=C cut -d: -f1-2

then boils down to<filename>:<byte-offset>



The command almost works with BSD grep

, except that the reported byte offset is invariably the start of the line on which the pattern was matched.
In other words: the byte offset will only be correct if the new characters do not precede the match in the file.
In addition, BSD grep

does not support specifying NUL ( 0x0

) bytes as part of the search string, even if provided via a file using -f

.

  • Note that there will be no parallel processing, but only a few grep

    usage-based invocations, find

    -exec ... +

    which, like xargs

    , passes as many filenames as will be placed on the command line before at grep

    once.
  • By providing grep

    search for a sequence of bytes directly, there is no need for xxd

    :
    • The sequence is specified as an ANSI C-quoted string , which means the escape sequences are expanded to literals by the shell, which allows Grep to then find the resulting string as a literal (via -f

      ), which is faster.
      The linked article is from the manual bash

      , but they work in zsh

      (and ksh

      ) too.
      • An alternative to GNU Grep is to use -P

        (support for PRCE, Perl-compatible regexes) with non-extended escape sequences, but this will be slower:grep -PHoab '\x{0e}\x{8b}\x{02}'

    • LC_ALL=C

      ensures that it grep

      treats each byte as its own character without applying any encoding rules.
    • -f

      treats search strings as literal (not regex)
    • -H

      appends the corresponding input filename to each output line; note that Grep does this implicitly if more than 1 filename argument is given
    • -o

      write only consistent lines (byte sequences), not the entire line (the concept of a line is irrelevant in binary files) [2]
    • -a

      treats binaries as if they were text files (without this Grep could only print text Binary file <filename> matches

      for input binaries with matches)
    • -b

      reports byte offsets matches

If it is enough to find at most 1 match for a given input file, add -m 1

.


[1] Newlines cannot be used because Grep invariably treats newlines in a search pattern string as separating multiple search patterns. Also, Grep is linear, so you can't match strings; The GNU Grep -null-data

option for splitting the input with NUL bytes might help, but only if your search byte sequence also doesn't contain NUL bytes; you will also need to represent your byte values ​​as escape sequences in the regexp combined with -P

- because you need to use the escape sequence \n

instead of actual newlines.

[2] is -o

needed to make the -b

byte offset message match as opposed to the start of the line (as pointed out, BSD Grep always does the latter, unfortunately); also, it is useful to only report the matches themselves here, as trying to print the entire line will result in unpredictable long output lines, given that there is no concept of lines in binaries; in any case, outputting bytes from the binary can cause strange rendering behavior in the terminal.

+2


source







All Articles