Sorting a log file using a connection
I have a file containing records sorted by timestamp, but containing multiple instances of the same timestamp, each with a separate section. I would like to concatenate all records with the same timestamp onto one line. Timestamp - Column 1
The input file can read
Time,Tag,Value
1,ABC,3
2,ABC,2.7
2,DEF,3.4
3,ABC,2.8
3,DEF,3.6
3,GHI,2.99
3,JKL,3.01
4,ABC,3.42
4,DEF,3.62
4,JKL,3.82
And the desired result will be similar (option 1);
Time,Tag,Value
1,ABC,3
2,ABC,2.7,DEF,3.4
3,ABC,2.8,DEF,3.6,GHI,2.99,JKL,3.01
4,ABC,3.42,DEF,3.62,JKL,3.82
And it will be even better (option 2);
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82
I am guessing that I can get to option 1 by writing a script using a loop. First, I would need to get a unique list of all "Tag" values ββto determine how many iterations I need to perform.
But I also assume that
1) Even in bash it can be costly for long files and; 2) There is perhaps an even more elegant way to do this.
Newb question. All help was appreciated.
thank
source to share
Assuming your data is in chronological order, you can use this awk solution:
parse.awk
# Use comma as input and output field separators
BEGIN { FS = OFS = "," }
# Print header and skip to next line
NR == 1 { print; next }
# If previous timestamp is the same as current append tag and value
pt == $1 {
tag = tag "|" $2
val = val "|" $3
}
# If not the first data line and timestamps are not equal then print
NR != 2 && pt != $1 { print pt, tag, val }
# Save previous timestamp and reset accumulator variables
pt != $1 {
pt = $1
tag = $2
val = $3
}
END { print pt, tag, val }
Run it like this:
awk -f parse.awk infile
Output:
Time,Tag,Value
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82
Or as a one-liner:
<infile awk 'BEGIN {FS=OFS=","} NR==1{print;next} pt==$1 {tag=tag"|"$2;val=val"|"$3} NR!=2&&pt!=$1 {print pt,tag,val} pt!=$1 {pt=$1;tag=$2;val=$3} END {print pt,tag,val}'
source to share
this would work:
awk -F, '{if($1 in a){ split(a[$1],t,","); a[$1]=t[1]"|"$2","t[2]"|"$3
}else a[$1]=$2","$3;}END{asort(a);for(x in a)print x","a[x]}' file|sort -n
with your example:
kent$ awk -F, '{if($1 in a){split(a[$1],t,","); a[$1]=t[1]"|"$2","t[2]"|"$3
}else a[$1]=$2","$3;}END{asort(a);for(x in a)print x","a[x]}' file|sort -n
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82
source to share
NEW ANSWER:
I understand that my previous answer can be difficult to read and understand - especially for beginners. However, it makes good use of gawk's array sorting functions, which would be very helpful when using the unique "tags" values ββyou talk about in your question. However, after reading some of the comments, I believe I may have misunderstood your question - perhaps just a little. Here's a method that doesn't care about the uniqueness of the "tags" and their values. He just connects them all. It also needs to be very readable and scalable. Run as:
awk -f script.awk file
Script.awk content:
BEGIN {
FS=OFS=","
}
NR==1 {
print
next
}
{
tag[$1]=(tag[$1] ? tag[$1] "|" : "") $2
val[$1]=(val[$1] ? val[$1] "|" : "") $3
}
END {
for (i in tag) {
print i, tag[i], val[i] | "sort -n"
}
}
Results:
Time,Tag,Value
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82
Alternatively, here's a one-liner:
awk -F, 'NR==1 { print; next } { tag[$1]=(tag[$1] ? tag[$1] "|" : "") $2; val[$1]=(val[$1] ? val[$1] "|" : "") $3 } END { for (i in tag) print i, tag[i], val[i] | "sort -n" }' OFS=, file
PREVIOUS ANSWER:
Here is one way: GNU awk
. Run as:
awk -f script.awk file
Contents script.awk
:
BEGIN {
FS=OFS=","
}
NR==1 {
print
next
}
{
a[$1][$2]=$3
}
END {
for (i in a) {
b[x++] = i
}
n = asort(b)
for (j=1;j<=n;j++) {
m = asorti(a[b[j]],c)
for (k=1;k<=m;k++) {
s = (s ? s "|" : "") c[k]
r = (r ? r "|" : "") a[b[j]][c[k]]
}
print b[j], s, r
s = r = ""
}
}
Results:
Time,Tag,Value
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82
Alternatively, here's a one-liner:
awk -F, 'NR==1 { print; next } { a[$1][$2]=$3 } END { for (i in a) b[x++] = i; n = asort(b); for (j=1;j<=n;j++) { m = asorti(a[b[j]],c); for (k=1;k<=m;k++) { s = (s ? s "|" : "") c[k]; r = (r ? r "|" : "") a[b[j]][c[k]] } print b[j], s, r; s = r = "" } }' OFS=, file
source to share
bash
is probably the wrong tool for this. Try Python:
import fileinput
import sys
oldTime = None
for line in fileinput.input():
line = line.strip()
pos = line.find(',')
time = line[0:pos]
if oldTime == time:
sys.stdout.write(',')
sys.stdout.write(line[pos+1:])
else:
if oldTime is not None:
sys.stdout.write('\n')
sys.stdout.write(line)
oldTime = time
sys.stdout.write('\n')
source to share
If this is an operation that you would like to repeat often, I would choose a script utility written in a more "complete" scripting language. Then you can call the script in your own bash script, or use it on the command line as needed.
Here's a Python example:
#!/usr/bin/env python
# --- merge_groups.py ----
import fileinput, operator, itertools
lines = (line.strip() for line in fileinput.input())
data = (line.split(",") for line in lines if line)
for key, group in itertools.groupby(data, operator.itemgetter(0)):
_, label, value = zip(*group)
print "%s,%s,%s" % (key, "|".join(label), "|".join(value))
Note that the script assumes that entries with the same timestamp are already grouped.
You can use a script to process existing data files or pipe data directly to it, for example:
[me@home]$ ./merge_groups.py data.txt # parse existing data file
Time,Tag,Value
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82
[me@home]$ cat data.txt | ./merge_groups.py # post-process command output
Time,Tag,Value
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82
source to share
@AaronDigulla and @Kent have some good solutions, but if you have / would like to go with bash, here is one:
for ts in `cat inputfile | cut --delimiter="," --fields=1 | uniq`
do
p1="";
p2="";
for line in `grep "^${ts}," inputfile | cut --delimiter="," --fields=2-`
do
f1=`echo ${line} | cut --delimiter="," --fields=1`;
f2=`echo ${line} | cut --delimiter="," --fields=2`;
p1=${p1}"|"$f1;
p2=${p2}"|"$f2;
done
echo ${ts}","${p1#?}","${p2#?};
done
source to share
Just in case, you want another awk solution!
function read() {
split($0, buf, ",")
}
function write() {
for (i = 1; i < length(buf); i++) {
printf "%s,", buf[i]
}
print buf[length(buf)]
}
BEGIN {
FS = ","
}
NR == 1 {
print
next
}
NR == 2 {
read()
next
}
{
if ($1 != time) { # new time
time = $1
write()
read()
} else { # repeated time
for (i = 2; i <= NF; i++) {
buf[i] = buf[i] "|" $i
}
}
}
END {
write()
}
I'm not that good with awk, so I had to emphasize readability!
source to share
Perl is not represented.
use strict;
my $skip_header = <>;
my %d;
while(<>) {
s/\s+$//;
my ($no, $k, $v ) = split ",";
push @{$d{int($no)}}, [ $k, $v ];
}
END {
foreach my $no (sort { $a <=> $b } keys %d ) {
print $no, ",";
print join("|", map { $_->[0] } @{$d{$no}});
print ",";
print join("|", map { $_->[1] } @{$d{$no}});
print "\n";
}
}
gives:
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82
source to share
the first:
> awk -F, '{a[$1]=a[$1]","$2","$3}END{for(i in a)print i","substr(a[i],2)}' temp | sort
1,ABC,3
2,ABC,2.7,DEF,3.4
3,ABC,2.8,DEF,3.6,GHI,2.99,JKL,3.01
4,ABC,3.42,DEF,3.62,JKL,3.82
second:
> awk -F, '{a[$1]=a[$1]"|"$2;b[$1]=b[$1]"|"$3}END{for(i in a)print i","substr(a[i],2)","substr(b[i],2)}' temp | sort
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82
source to share