Grep multiple lines on large files

Question

Grep multiple lines on large files

I have a large number of large log files (each log file is about 200MB and I have 200GB of data).

The server writes about 10 thousand parameters (with a time stamp) to the log file every 10 minutes. From each 10K parameters, I want to extract 100 of them into a new file.

First I used grep with 1 parameter, then I LC_ALL=C

made it a little faster, then I used fgrep, it was also a little faster. Then I used parallel

parallel -j 2 --pipe --block 20M

and finally, for every 200 MB, I was able to extract 1 parameter in 5 seconds.

BUT ... when I bundle multiple parameters in one grep

parallel -j 2 --pipe --block 20M "egrep -n 'param1|param2|...|param100" < log.txt

then the time for the grep operation increased linearly (it only takes a few minutes for grep 1 file now). (Note that I had to use egrep for multiple channels as they didn't like grep).

Is there a faster / better way to solve this problem?

Please note that I don't need to use a regex because the patterns I'm looking for have been fixed. I just want to extract specific lines that contain a specific string.

0

unix bash grep logging large-files

user1048858 04 jul. 13 at 17:41

source to share

2 answers

jm666 · Answer 1 · 2013-07-04T18:16:43+0000

In light of the comments above, I made another test. Allocated my file from the command md5deep -rZ

(size: 319 MB). Randomly selected 100 md5 checksums (each 32chars long).

time egrep '100|fixed|strings' md5 >/dev/null

time

real    0m16.888s
user    0m16.714s
sys     0m0.172s

for

time fgrep -f 100_lines_patt_file md5 >/dev/null

time

real    0m1.379s
user    0m1.220s
sys     0m0.158s

Almost 15 times faster than egrep.

So when you only get 0.3s betwen egrep

and fgrep

IMHO improvements , which mean:

your IO should slow down

The egrep computation time is not slowed down by CPU or memory, but IO and (IMHO), so you don't get any speed improvement with fgrep

.

user1048858 · Answer 2 · 2013-07-05T17:39:33+0000

Interestingly, compressing the log files in .gz format and using zgrep -E have significantly reduced the time. It also didn't matter if I was looking for a single pattern or multiple patterns in the same zgrep command, it just ran for about ~ 1 second per 200MB file.

Grep multiple lines on large files

More articles: