AWK / GAWK performance

I have 84 million lines of XML that I am processing with gawk on Red Hat Linux. (OK, some people recommend using tools other than GAWK, but my XML doesn't have multi-line tags or any other feature that makes GAWK not the best choice for the job.)

My concern for performance.

My initial AWK script looks something like this:

# Test_1.awk
BEGIN {FS = "<|:|=";}
{
if ($3 == "SubNetwork id")
    {
    # do something
    }
}
END {
# print something
}

      

This makes 84 million lines of comparison after each line.

I noticed that "SubNetwork id" only appears when there are 4 fields in the row (NF = 4), so I modified the script to make fewer row mappings:

# Test_2.awk
BEGIN {FS = "<|:|=";}
{
if (NF == 4)
    {
    if ($3 == "SubNetwork id")
        {
        # do something
        }
    }
}
END {
# print something
}

      

I ran it and saw that I checked "NF == 4" 84 million times (obviously) and "$ 3 ==" SubNetwork id "" just 3 million times. Great, I've reduced the number of string comparisons, which I thought was more time consuming than simple integer comparisons (NF is an integer, right?).

My surprise came when I tested both scenarios for performance. In most cases Test_1 was faster than Test_2. I run them many times to account for other processes that might be using CPU time, but in general my tests ran when the CPU was more or less "idle".

My brain tells me that 84 million integer comparisons plus 3 million string comparisons should be faster than 84 million string comparisons, but there is obviously something wrong with my reasoning.

My XML looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<ConfigDataFile xmlns:un="specific.xsd" xmlns:xn="generic.xsd">
    <configData dnPrefix="Undefined">
        <xn:SubNetwork id="ROOT_1">
            <xn:SubNetwork id="ROOT_2">
                <xn:attributes>
                ...
                </xn:attributes>
            </xn:SubNetwork>
            <xn:SubNetwork id="ID_1">
            ....
            </xn:SubNetwork>
            <xn:SubNetwork id="ID_2">
            .....
            </xn:SubNetwork>
        </xn:SubNetwork>
    </configData>
</ConfigDataFile>

      

Any help to understand this performance issue would be appreciated.

Thanks in advance.

+3


source to share


3 answers


I did more tests:

1- Create a large file with some data

yes 'SomeSampleText SomeOtherText 33 1970 YetAnotherText 777 abc 1 AndSomeMore' | head -12000000 > SomeData.txt

      

The separator is a space!

2. Run these 6 tests several times and calculate the average time for each test. I did this on 3 different machines (with Red Hat Linux Enterprise 4)

time gawk 'BEGIN {a = 0;} {if ($5 == "YetAnotherText") a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} {if ($0 ~ /YetAnotherText/) a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} /YetAnotherText/ {a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} {if (NF == 9) a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} {if ($1 == "SomeSampleText") a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} {if ($9 == "AndSomeMore") a ++;} END {print "a: " a;}' SomeData.txt

      

3- I got these results (numbers are seconds)

-- Machine 1
10.35
39.39
38.87
10.40
7.72
12.26

-- Machine 2
8.50
32.43
31.83
9.10
6.54
9.91

-- Machine 3
12.35
13.55
12.90
14.40
9.43
14.93

      

Looks like searching for the / YetAnotherText / pattern in tests 2 and 3 was very slow. Except for machine 3 ...



4- Create another big file with some data with different delimiters

yes "<SomeSampleText:SomeOtherText=33>1970<YetAnotherText:777=abc>1<AndSomeMore>" | head -12000000 > SomeData2.txt

      

5 Run 6 tests by changing FS

time gawk 'BEGIN {FS = "<|:|=";} {if ($5 == "YetAnotherText") a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} {if ($0 ~ /YetAnotherText/) a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} /YetAnotherText/ {a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} {if (NF == 8) a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} {if ($2 == "SomeSampleText") a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} {if ($8 == "AndSomeMore>") a ++;} END {print "a: " a;}' SomeData2.txt

      

6- I got these results (I only did this for Machine 3, sorry)

66.17
33.11
32.16
76.77
37.17
77.20

      

My findings (also see comments from user @ user31264):

  • Parsing and fielding seems to be faster when there is one simple delimiter instead of multiple delimiters.
  • Usually getting $ N is faster than getting $ M, where N <M
  • In some cases, searching for / pattern / throughout the string is faster than comparing $ N == "pattern", especially if N is not one of the first fields in the string
  • Getting NF can be slow because the string has to be parsed and fields calculated, and furthermore if multiple delimiters exist
+1


source


Below is a simple test. The first line prints 10,000,000 lines "abc d" to file a. awk

- GNU Awk 4.1.3

[~] yes 'a b c d' | h -10000000 > a
[~] time awk '{if(NF==5)print("a")}' a
2.344u 0.012s 0:02.36 99.5%     0+0k 0+0io 0pf+0w
[~] time awk '{if(NF==5)print("a")}' a
2.364u 0.008s 0:02.37 99.5%     0+0k 0+0io 0pf+0w
[~] time awk '{if($4=="Hahaha")print("a")}' a
2.876u 0.024s 0:02.90 99.6%     0+0k 0+0io 0pf+0w
[~] time awk '{if($4=="Hahaha")print("a")}' a
2.880u 0.020s 0:02.90 100.0%    0+0k 0+0io 0pf+0w
[~] time awk '{if($1=="Hahaha")print("a")}' a
2.540u 0.020s 0:02.56 100.0%    0+0k 0+0io 0pf+0w
[~] time awk '{if($1=="Hahaha")print("a")}' a
2.404u 0.004s 0:02.41 99.5%     0+0k 0+0io 0pf+0w

      

As you can see, checking $ 1 is faster than checking $ 4 because in the first case, AWK only needs to parse the string up to the first word. If you only check NF, AWK only counts words, which were even faster in my case, but in your case it would be slower to count words than parsing the input string up to the third word.

Finally, we can speed up AWK like this:

[~] time awk '/Hahaha/{if($4=="Hahaha")print("a")}' a
1.376u 0.020s 0:01.40 99.2%     0+0k 0+0io 0pf+0w
[~] time awk '/Hahaha/{if($4=="Hahaha")print("a")}' a
1.372u 0.028s 0:01.40 99.2%     0+0k 0+0io 0pf+0w

      

because it /Hahaha/

doesn't require any parsing.



If you add /SubNetwork id/

before {

it can speed up your work.

If you only process lines with "SuNetwork id" and ignore all others, you may want to do

grep 'SubNetwork id' your_input_file | awk -f prog.awk

      

This will speed things up a lot since grep is much faster than awk.

Finally, another way to speed up awk is to use mawk, which is much faster than gawk. Unfortunately, it sometimes gives different results than gawk, so you should always check it.

0


source


another simple test

file is 3,000,000 lines of your entire sample. The result is a representative time after the 3rd run (for cache and other OS exposure).

# time awk 'BEGIN{FS="[<:=]"}NF>=4{a++}END{print a+0}' YourFile
780100
real    0m1.89s user    0m1.74s sys     0m0.01s

# time awk 'BEGIN{FS="<|:|="}NF>=4{a++}END{print a+0}' YourFile
780100
real    0m2.00s user    0m1.91s sys     0m0.02s

# time awk 'BEGIN{FS="<|:|="}NF>=4&&/:SubNetwork/{a++}END{print a+0}' YourFile
780100
real    0m3.09s user    0m2.93s sys     0m0.02s

# time awk 'BEGIN{FS=":SubNetwork"}NF>=2{a++}END{print a+0}' YourFile
1560200
real    0m1.32s user    0m1.27s sys     0m0.02s

# time awk '/:SubNetwork/{a++}END{print a}' YourFile
1560200
real    0m3.23s user    0m3.06s sys     0m0.02s

      

show that if you use the field separator :SubNetwork

as the fastest,

Now, for the next step, you might need to split or repartition the field with something like FS="<|:|=";$1=$1"";$0=$0""; ... your action ...; FS=":SubNetwork"

additional test as pre-filter

# time awk '$1 == "<xn:SubNetwork" || $1 == "<xn:Attributes" {a++}END{print a+0}' YourFile
780100
real    0m1.29s user    0m1.20s sys     0m0.03s

      

0


source







All Articles