Grep multiple lines on large files
I have a large number of large log files (each log file is about 200MB and I have 200GB of data).
The server writes about 10 thousand parameters (with a time stamp) to the log file every 10 minutes. From each 10K parameters, I want to extract 100 of them into a new file.
First I used grep with 1 parameter, then I LC_ALL=C
made it a little faster, then I used fgrep, it was also a little faster. Then I used parallel
parallel -j 2 --pipe --block 20M
and finally, for every 200 MB, I was able to extract 1 parameter in 5 seconds.
BUT ... when I bundle multiple parameters in one grep
parallel -j 2 --pipe --block 20M "egrep -n 'param1|param2|...|param100" < log.txt
then the time for the grep operation increased linearly (it only takes a few minutes for grep 1 file now). (Note that I had to use egrep for multiple channels as they didn't like grep).
Is there a faster / better way to solve this problem?
Please note that I don't need to use a regex because the patterns I'm looking for have been fixed. I just want to extract specific lines that contain a specific string.
source to share
In light of the comments above, I made another test. Allocated my file from the command md5deep -rZ
(size: 319 MB). Randomly selected 100 md5 checksums (each 32chars long).
time egrep '100|fixed|strings' md5 >/dev/null
time
real 0m16.888s
user 0m16.714s
sys 0m0.172s
for
time fgrep -f 100_lines_patt_file md5 >/dev/null
time
real 0m1.379s
user 0m1.220s
sys 0m0.158s
Almost 15 times faster than egrep.
So when you only get 0.3s betwen egrep
and fgrep
IMHO improvements , which mean:
- your IO should slow down
The egrep computation time is not slowed down by CPU or memory, but IO and (IMHO), so you don't get any speed improvement with fgrep
.
source to share