Awk sum multiple files show lines that do not appear in both sets of files
I am using awk to summarize multiple files, this is used to summarize server log parsing summary values, it really does speed up the final total count, but I ran into a minor issue and the typical examples I have hitting the net did not help.
Here's an example:
cat file1
aa 1
bb 2
cc 3
ee 4
cat file2
aa 1
bb 2
cc 3
dd 4
cat file3
aa 1
bb 2
cc 3
ff 4
And the script:
cat test.sh
#!/bin/bash
files="file1 file2 file3"
i=0;
oldname="";
for names in $(echo $files); do
((i++));
if [ $i == 1 ]; then
oldname=$names
#echo "-- $i $names"
shift;
else
oldname1=$names.$$
awk 'NR==FNR { _[$1]=$2 } NR!=FNR { if(_[$1] != "") nn=0; nn=($2+_[$1]); print $1" "nn }' $names $oldname> $oldname1
if [ $i -gt 2 ]; then
rm $oldname;
fi
oldname=$oldname1
fi
done
echo "------------------------------ $i"
cat $oldname
When I run this, identical columns are added up, but the ones that only appear in one of the files are not
./test.sh
------------------------------ 3
aa 3
bb 6
cc 9
ee 4
ff dd doesn't show up in the list, from what I've seen within NR == FR
I ran into this:
http://dbaspot.com/shell/246751-awk-comparing-two-files-problem.html
you want all the lines in file1 that are not in file2,
awk 'NR == FNR { a[$0]; next } !($0 in a)' file2 file1
If you want only uniq lines in file1 that are not in file2,
awk 'NR == FNR { a[$0]; next } !($0 in a) { print; a[$0] }'
file2
file1
but this only complicates the current problem when trying, as many other fields are duplicated.
After posting the question - content updates ... and tests ....
I wanted to stick with awk as it seems like a much shorter way to achieve the result. The problem has not been resolved yet.
awk '{a[$1]+=$2}END{for (k in a) print k,a[k]}' file1 file2 file3
aa 3
bb 6
cc 9
ee 4
ff 4
gg 4
RESULT_SET_4 0
RESULT_SET_3 0
RESULT_SET_2 0
RESULT_SET_1 0
$ cat file1
RESULT_SET_1
aa 1
RESULT_SET_2
bb 2
RESULT_SET_3
cc 3
RESULT_SET_4
ff 4
$ cat file2
RESULT_SET_1
aa 1
RESULT_SET_2
bb 2
RESULT_SET_3
cc 3
RESULT_SET_4
ee 4
The content of the file is not left as it was originally, i.e. the results are not under the headers, my original method kept it completely.
Updated expected result - headers in correct context
cat file1
RESULT_SET_1
aa 1
RESULT_SET_2
bb 2
RESULT_SET_3
cc 3
RESULT_SET_4
ff 4
cat file2
RESULT_SET_1
aa 1
RESULT_SET_2
bb 2
RESULT_SET_3
cc 3
RESULT_SET_4
ee 4
cat file3
RESULT_SET_1
aa 1
RESULT_SET_2
bb 2
RESULT_SET_3
cc 3
RESULT_SET_4
gg 4
test.sh awk line to produce above is :
awk -v i=$i 'NR==FNR { _[$1]=$2 } NR!=FNR { if (_[$1] != "") { if ($2 ~ /[0-9]/) { nn=($2+_[$1]); print $1" "nn; } else { print;} }else { print; } }' $names $oldname> $oldname1
./test.sh
------------------------------ 3
RESULT_SET_1
aa 3
RESULT_SET_2
bb 6
RESULT_SET_3
cc 9
RESULT_SET_4
ff 4
works but destroys required formatting
awk '($2 != "") {a[$1]+=$2}; ($2 == "") { a[$1]=$2 } END {for (k in a) print k,a[k]} ' file1 file2 file3
aa 3
bb 6
cc 9
ee 4
ff 4
gg 4
RESULT_SET_4
RESULT_SET_3
RESULT_SET_2
RESULT_SET_1
source to share
$ awk '{a[$1]+=$2}END{for (k in a) print k,a[k]}' file1 file2 file3 | sort
aa 3
bb 6
cc 9
dd 4
ee 4
ff 4
Edit:
A bit of a hack, but it does the job:
$ awk 'FNR==NR&&!/RESULT/{a[$1]=$2;next}($1 in a){a[$1]+=$2}END{for (k in a) print k,a[k]}' file1 file2 file3 | sort | awk '$1="RESULTS_SET_"NR"\n"$1'
RESULTS_SET_1
aa 3
RESULTS_SET_2
bb 6
RESULTS_SET_3
cc 9
RESULTS_SET_4
ff 4
source to share
You can do it in awk
as sudo_O suggested, but you can also do it in pure bash.
#!/bin/bash
# We'll use an associative array, where the indexes are strings.
declare -A a
# Our list of files, in an array (not associative)
files=(file1 file2 file3)
# Walk through array of files...
for file in "${files[@]}"; do
# And for each file, increment the array index with the value.
while read index value; do
((a[$index]+=$value))
done < "$file"
done
# Walk through array. ${!...} returns a list of indexes.
for i in ${!a[@]}; do
echo "$i ${a[$i]}"
done
And the result:
$ ./doit
dd 4
aa 3
ee 4
bb 6
ff 4
cc 9
And if you want a sorted result ... you can pipe it through sort
. :)
source to share
Here is one way: GNU awk
. Run as:
awk -f script.awk File1 File2 File3
Contents script.awk
:
sub(/RESULT_SET_/,"") {
i = $1
next
}
{
a[i][$1]+=$2
}
END {
for (j=1;j<=length(a);j++) {
print "RESULT_SET_" j
for (k in a[j]) {
print k, a[j][k]
}
}
}
Results:
RESULT_SET_1
aa 3
RESULT_SET_2
bb 6
RESULT_SET_3
cc 9
RESULT_SET_4
ee 4
ff 4
gg 4
Alternatively, here's a one-liner:
awk 'sub(/RESULT_SET_/,"") { i = $1; next } { a[i][$1]+=$2 } END { for (j=1;j<=length(a);j++) { print "RESULT_SET_" j; for (k in a[j]) print k, a[j][k] } }' File1 File2 File3
source to share
fixed with this Basically, it goes through every file, if the entry exists on the other side, it will add an entry to get closer to the 0-digit line number so that it can summarize the content - tested this on my current output and, seems to be working real well
#!/bin/bash
files="file1 file2 file3 file4 file5 file6 file7 file8"
RAND="$$"
i=0;
oldname="";
for names in $(echo $files); do
((i++));
if [ $i == 1 ]; then
oldname=$names
shift;
else
oldname1=$names.$RAND
for entries in $(awk -v i=$i 'NR==FNR { _[$1]=$2 } NR!=FNR { if (_[$1] == "") { if ($2 ~ /[0-9]/) { nn=0; nn=(_[$1]+=$2); print FNR"-"$1"%0"} else { } } else { } }' $oldname $names); do
line=$(echo ${entries%%-*})
content=$(echo ${entries#*-})
content=$(echo $content|tr "%" " ")
edit=$(ed -s $oldname << EOF
$line
a
$content
.
w
q
EOF
)
$edit >/dev/null 2>&1
done
awk -v i=$i 'NR==FNR { _[$1]=$2 } NR!=FNR { if (_[$1] != "") { if ($2 ~ /[0-9]/) { nn=0; nn=($2+_[$1]); print $1" "nn; } else { print $1;} }else { print; } }' $names $oldname> $oldname1
oldname=$oldname1
fi
done
cat $oldname
#rm file?.*
source to share