Sorting a log file using a connection

I have a file containing records sorted by timestamp, but containing multiple instances of the same timestamp, each with a separate section. I would like to concatenate all records with the same timestamp onto one line. Timestamp - Column 1

The input file can read

Time,Tag,Value  
1,ABC,3  
2,ABC,2.7  
2,DEF,3.4  
3,ABC,2.8  
3,DEF,3.6  
3,GHI,2.99  
3,JKL,3.01  
4,ABC,3.42  
4,DEF,3.62  
4,JKL,3.82  

      

And the desired result will be similar (option 1);

Time,Tag,Value  
1,ABC,3  
2,ABC,2.7,DEF,3.4  
3,ABC,2.8,DEF,3.6,GHI,2.99,JKL,3.01  
4,ABC,3.42,DEF,3.62,JKL,3.82  

      

And it will be even better (option 2);

1,ABC,3  
2,ABC|DEF,2.7|3.4  
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01  
4,ABC|DEF|JKL,3.42|3.62|3.82  

      

I am guessing that I can get to option 1 by writing a script using a loop. First, I would need to get a unique list of all "Tag" values ​​to determine how many iterations I need to perform.

But I also assume that

1) Even in bash it can be costly for long files and; 2) There is perhaps an even more elegant way to do this.

Newb question. All help was appreciated.

thank

+3


source to share


12 replies


Assuming your data is in chronological order, you can use this awk solution:

parse.awk

# Use comma as input and output field separators
BEGIN { FS = OFS = "," }

# Print header and skip to next line
NR == 1 { print; next }

# If previous timestamp is the same as current append tag and value
pt == $1 {
  tag = tag "|" $2
  val = val "|" $3
}

# If not the first data line and timestamps are not equal then print
NR != 2 && pt != $1 { print pt, tag, val }

# Save previous timestamp and reset accumulator variables    
pt != $1 {
  pt  = $1
  tag = $2
  val = $3
}

END { print pt, tag, val }

      

Run it like this:

awk -f parse.awk infile

      



Output:

Time,Tag,Value
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82

      

Or as a one-liner:

<infile awk 'BEGIN {FS=OFS=","} NR==1{print;next} pt==$1 {tag=tag"|"$2;val=val"|"$3} NR!=2&&pt!=$1 {print pt,tag,val} pt!=$1 {pt=$1;tag=$2;val=$3} END {print pt,tag,val}'

      

+1


source


this would work:

awk -F, '{if($1 in a){ split(a[$1],t,","); a[$1]=t[1]"|"$2","t[2]"|"$3
}else a[$1]=$2","$3;}END{asort(a);for(x in a)print x","a[x]}' file|sort -n

      



with your example:

kent$  awk -F, '{if($1 in a){split(a[$1],t,","); a[$1]=t[1]"|"$2","t[2]"|"$3
}else a[$1]=$2","$3;}END{asort(a);for(x in a)print x","a[x]}' file|sort -n                                                                                                  
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82

      

+1


source


NEW ANSWER:

I understand that my previous answer can be difficult to read and understand - especially for beginners. However, it makes good use of gawk's array sorting functions, which would be very helpful when using the unique "tags" values ​​you talk about in your question. However, after reading some of the comments, I believe I may have misunderstood your question - perhaps just a little. Here's a method that doesn't care about the uniqueness of the "tags" and their values. He just connects them all. It also needs to be very readable and scalable. Run as:

awk -f script.awk file

      

Script.awk content:

BEGIN {
    FS=OFS=","
}

NR==1 {
    print
    next
}

{
    tag[$1]=(tag[$1] ? tag[$1] "|" : "") $2
    val[$1]=(val[$1] ? val[$1] "|" : "") $3
}

END {
    for (i in tag) {
        print i, tag[i], val[i] | "sort -n"
    }
}

      

Results:

Time,Tag,Value
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82

      

Alternatively, here's a one-liner:

awk -F, 'NR==1 { print; next } { tag[$1]=(tag[$1] ? tag[$1] "|" : "") $2; val[$1]=(val[$1] ? val[$1] "|" : "") $3 } END { for (i in tag) print i, tag[i], val[i] | "sort -n" }' OFS=, file

      


PREVIOUS ANSWER:

Here is one way: GNU awk

. Run as:

awk -f script.awk file

      

Contents script.awk

:

BEGIN {
    FS=OFS=","
}

NR==1 {
    print
    next
}

{
    a[$1][$2]=$3
}

END {

    for (i in a) {
        b[x++] = i
    }

    n = asort(b)

    for (j=1;j<=n;j++) {

        m = asorti(a[b[j]],c)

        for (k=1;k<=m;k++) {

            s = (s ? s "|" : "") c[k]
            r = (r ? r "|" : "") a[b[j]][c[k]]
        }

        print b[j], s, r
        s = r = ""
    }
}

      

Results:

Time,Tag,Value
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82

      

Alternatively, here's a one-liner:

awk -F, 'NR==1 { print; next } { a[$1][$2]=$3 } END { for (i in a) b[x++] = i; n = asort(b); for (j=1;j<=n;j++) { m = asorti(a[b[j]],c); for (k=1;k<=m;k++) { s = (s ? s "|" : "") c[k]; r = (r ? r "|" : "") a[b[j]][c[k]] } print b[j], s, r; s = r = "" } }' OFS=, file

      

+1


source


bash

is probably the wrong tool for this. Try Python:

import fileinput
import sys

oldTime = None
for line in fileinput.input():
    line = line.strip()
    pos = line.find(',')
    time = line[0:pos]
    if oldTime == time:
        sys.stdout.write(',')
        sys.stdout.write(line[pos+1:])
    else:
        if oldTime is not None:
            sys.stdout.write('\n')
        sys.stdout.write(line)

    oldTime = time

sys.stdout.write('\n')

      

0


source


If this is an operation that you would like to repeat often, I would choose a script utility written in a more "complete" scripting language. Then you can call the script in your own bash script, or use it on the command line as needed.

Here's a Python example:

#!/usr/bin/env python
# --- merge_groups.py ----
import fileinput, operator, itertools
lines = (line.strip() for line in fileinput.input())
data = (line.split(",") for line in lines if line)
for key, group in itertools.groupby(data, operator.itemgetter(0)):
  _, label, value =  zip(*group)
  print "%s,%s,%s" % (key, "|".join(label), "|".join(value))

      

Note that the script assumes that entries with the same timestamp are already grouped.

You can use a script to process existing data files or pipe data directly to it, for example:

[me@home]$ ./merge_groups.py data.txt  # parse existing data file
Time,Tag,Value
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82

[me@home]$ cat data.txt | ./merge_groups.py  # post-process command output
Time,Tag,Value
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82

      

0


source


@AaronDigulla and @Kent have some good solutions, but if you have / would like to go with bash, here is one:

for ts in `cat inputfile | cut --delimiter="," --fields=1 | uniq`
do
  p1="";
  p2="";
  for line in `grep "^${ts}," inputfile | cut --delimiter="," --fields=2-`
  do
    f1=`echo ${line} | cut --delimiter="," --fields=1`;
    f2=`echo ${line} | cut --delimiter="," --fields=2`;
    p1=${p1}"|"$f1;
    p2=${p2}"|"$f2;
  done
  echo ${ts}","${p1#?}","${p2#?};
done

      

0


source


Just in case, you want another awk solution!

function read() {
  split($0, buf, ",")
}

function write() {
  for (i = 1; i < length(buf); i++) {
    printf "%s,", buf[i]
  }
  print buf[length(buf)]
}

BEGIN {
  FS = ","
}

NR == 1 {
  print
  next
}

NR == 2 {
  read()
  next
}

{
  if ($1 != time) { # new time                                                                                                                                                                   
    time = $1
    write()
    read()
  } else { # repeated time                                                                                                                                                                       
    for (i = 2; i <= NF; i++) {
      buf[i] = buf[i] "|" $i
    }
  }
}

END {
  write()
}

      

I'm not that good with awk, so I had to emphasize readability!

0


source


Hmm, you really said "all the help", so this would include a Ruby solution?

require 'csv'

puts(CSV.read('f.csv').group_by(&:first).map do |k, v|
  t = v.transpose
  [k, t[1].join('|'), t[2].join('|')].join(',')
end.drop(1))

      

0


source


Perl is not represented.

use strict;
my $skip_header = <>;
my %d;
while(<>) {
    s/\s+$//;
    my ($no, $k, $v )  = split ",";
    push @{$d{int($no)}}, [ $k,  $v ];
}
END {
    foreach my $no (sort { $a <=> $b } keys %d  )  {
        print $no, ",";
        print join("|", map { $_->[0] } @{$d{$no}});
        print ",";
        print join("|", map { $_->[1] } @{$d{$no}});
        print "\n";
    }
}

      

gives:

1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82

      

0


source


For the first option, you can try:

awk -F, 'p x!=$1{if(p x)print s; p=s=$1} {sub($1,x); s=s $0} END{print s}' file

      

0


source


the first:

> awk -F, '{a[$1]=a[$1]","$2","$3}END{for(i in a)print i","substr(a[i],2)}' temp | sort
1,ABC,3
2,ABC,2.7,DEF,3.4
3,ABC,2.8,DEF,3.6,GHI,2.99,JKL,3.01
4,ABC,3.42,DEF,3.62,JKL,3.82

      

second:

> awk -F, '{a[$1]=a[$1]"|"$2;b[$1]=b[$1]"|"$3}END{for(i in a)print i","substr(a[i],2)","substr(b[i],2)}' temp | sort
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82

      

0


source


In short: sed

way

sed -ne ':a;$!N;/^\([0-9]\+\),.*\n\1,/s/\n[0-9]*//;ta;P;D'
Time,Tag,Value
1,ABC,3
2,ABC,2.7,DEF,3.4
3,ABC,2.8,DEF,3.6,GHI,2.99,JKL,3.01
4,ABC,3.42,DEF,3.62,JKL,3.82

      

0


source







All Articles