Recursively search a directory of binaries for a hex sequence?
The current commands I use to find some hex values (for example 0A 8b 02
) include:
find . -type f -not -name "*.png" -exec xxd -p {} \; | grep "0a8b02" || xargs -0 -P 4
Can this be improved by considering the following goals:
- find files recursively
- display offset and filename
- exclude certain files with certain extensions (the example above will not search for files
.png
) - speed: search should handle 200,000 files (50KB to 1MB) for a total of ~ 2GB.
I'm not too sure if it xargs
works correctly for 4 CPUs. Also I find it hard to print the filename when it grep
finds a match since it is sent from xxd
. Any suggestions?
source to share
IF:
- Do you have GNU
grep
- And the hex bytes you are looking for NEVER contain newlines (
0xa
) [1]- If they contain NUL (
0x
), you must provide the search stringgrep
via file (-f
), not a direct argument.
- If they contain NUL (
the following command will get you there using the lookup example 0e 8b 02
:
LC_ALL=C find . -type f -not -name "*.png" -exec grep -FHoab $'\x{0e}\x{8b}\x{02}' {} + |
LC_ALL=C cut -d: -f1-2
The command grep
creates output lines as follows:
<filename>:<byte-offset>:<matched-bytes>
which LC_ALL=C cut -d: -f1-2
then boils down to<filename>:<byte-offset>
The command almost works with BSD grep
, except that the reported byte offset is invariably the start of the line on which the pattern was matched.
In other words: the byte offset will only be correct if the new characters do not precede the match in the file.
In addition, BSD grep
does not support specifying NUL ( 0x0
) bytes as part of the search string, even if provided via a file using -f
.
- Note that there will be no parallel processing, but only a few
grep
usage-based invocations,find
-exec ... +
which, likexargs
, passes as many filenames as will be placed on the command line before atgrep
once. - By providing
grep
search for a sequence of bytes directly, there is no need forxxd
:- The sequence is specified as an ANSI C-quoted string , which means the escape sequences are expanded to literals by the shell, which allows Grep to then find the resulting string as a literal (via
-f
), which is faster.
The linked article is from the manualbash
, but they work inzsh
(andksh
) too.- An alternative to GNU Grep is to use
-P
(support for PRCE, Perl-compatible regexes) with non-extended escape sequences, but this will be slower:grep -PHoab '\x{0e}\x{8b}\x{02}'
- An alternative to GNU Grep is to use
-
LC_ALL=C
ensures that itgrep
treats each byte as its own character without applying any encoding rules. -
-f
treats search strings as literal (not regex) -
-H
appends the corresponding input filename to each output line; note that Grep does this implicitly if more than 1 filename argument is given -
-o
write only consistent lines (byte sequences), not the entire line (the concept of a line is irrelevant in binary files) [2] -
-a
treats binaries as if they were text files (without this Grep could only print textBinary file <filename> matches
for input binaries with matches) -
-b
reports byte offsets matches
- The sequence is specified as an ANSI C-quoted string , which means the escape sequences are expanded to literals by the shell, which allows Grep to then find the resulting string as a literal (via
If it is enough to find at most 1 match for a given input file, add -m 1
.
[1] Newlines cannot be used because Grep invariably treats newlines in a search pattern string as separating multiple search patterns. Also, Grep is linear, so you can't match strings; The GNU Grep -null-data
option for splitting the input with NUL bytes might help, but only if your search byte sequence also doesn't contain NUL bytes; you will also need to represent your byte values as escape sequences in the regexp combined with -P
- because you need to use the escape sequence \n
instead of actual newlines.
[2] is -o
needed to make the -b
byte offset message match as opposed to the start of the line (as pointed out, BSD Grep always does the latter, unfortunately); also, it is useful to only report the matches themselves here, as trying to print the entire line will result in unpredictable long output lines, given that there is no concept of lines in binaries; in any case, outputting bytes from the binary can cause strange rendering behavior in the terminal.
source to share