Sorting a log file using a connection

Question

Sorting a log file using a connection

I have a file containing records sorted by timestamp, but containing multiple instances of the same timestamp, each with a separate section. I would like to concatenate all records with the same timestamp onto one line. Timestamp - Column 1

The input file can read

Time,Tag,Value  
1,ABC,3  
2,ABC,2.7  
2,DEF,3.4  
3,ABC,2.8  
3,DEF,3.6  
3,GHI,2.99  
3,JKL,3.01  
4,ABC,3.42  
4,DEF,3.62  
4,JKL,3.82

And the desired result will be similar (option 1);

Time,Tag,Value  
1,ABC,3  
2,ABC,2.7,DEF,3.4  
3,ABC,2.8,DEF,3.6,GHI,2.99,JKL,3.01  
4,ABC,3.42,DEF,3.62,JKL,3.82

And it will be even better (option 2);

1,ABC,3  
2,ABC|DEF,2.7|3.4  
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01  
4,ABC|DEF|JKL,3.42|3.62|3.82

I am guessing that I can get to option 1 by writing a script using a loop. First, I would need to get a unique list of all "Tag" values to determine how many iterations I need to perform.

But I also assume that

1) Even in bash it can be costly for long files and; 2) There is perhaps an even more elegant way to do this.

Newb question. All help was appreciated.

thank

+3

join bash shell awk

Pascoe 18 Feb 13 at 13:18

source to share

12 replies

this would work:

awk -F, '{if($1 in a){ split(a[$1],t,","); a[$1]=t[1]"|"$2","t[2]"|"$3
}else a[$1]=$2","$3;}END{asort(a);for(x in a)print x","a[x]}' file|sort -n

with your example:

kent$  awk -F, '{if($1 in a){split(a[$1],t,","); a[$1]=t[1]"|"$2","t[2]"|"$3
}else a[$1]=$2","$3;}END{asort(a);for(x in a)print x","a[x]}' file|sort -n                                                                                                  
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82

+1

Kent 18 Feb 13 at 13:32

source to share

NEW ANSWER:

I understand that my previous answer can be difficult to read and understand - especially for beginners. However, it makes good use of gawk's array sorting functions, which would be very helpful when using the unique "tags" values you talk about in your question. However, after reading some of the comments, I believe I may have misunderstood your question - perhaps just a little. Here's a method that doesn't care about the uniqueness of the "tags" and their values. He just connects them all. It also needs to be very readable and scalable. Run as:

awk -f script.awk file

Script.awk content:

BEGIN {
    FS=OFS=","
}

NR==1 {
    print
    next
}

{
    tag[$1]=(tag[$1] ? tag[$1] "|" : "") $2
    val[$1]=(val[$1] ? val[$1] "|" : "") $3
}

END {
    for (i in tag) {
        print i, tag[i], val[i] | "sort -n"
    }
}

Results:

Time,Tag,Value
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82

Alternatively, here's a one-liner:

awk -F, 'NR==1 { print; next } { tag[$1]=(tag[$1] ? tag[$1] "|" : "") $2; val[$1]=(val[$1] ? val[$1] "|" : "") $3 } END { for (i in tag) print i, tag[i], val[i] | "sort -n" }' OFS=, file

PREVIOUS ANSWER:

Here is one way: GNU awk

. Run as:

awk -f script.awk file

Contents script.awk

:

BEGIN {
    FS=OFS=","
}

NR==1 {
    print
    next
}

{
    a[$1][$2]=$3
}

END {

    for (i in a) {
        b[x++] = i
    }

    n = asort(b)

    for (j=1;j<=n;j++) {

        m = asorti(a[b[j]],c)

        for (k=1;k<=m;k++) {

            s = (s ? s "|" : "") c[k]
            r = (r ? r "|" : "") a[b[j]][c[k]]
        }

        print b[j], s, r
        s = r = ""
    }
}

Results:

Time,Tag,Value
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82

Alternatively, here's a one-liner:

awk -F, 'NR==1 { print; next } { a[$1][$2]=$3 } END { for (i in a) b[x++] = i; n = asort(b); for (j=1;j<=n;j++) { m = asorti(a[b[j]],c); for (k=1;k<=m;k++) { s = (s ? s "|" : "") c[k]; r = (r ? r "|" : "") a[b[j]][c[k]] } print b[j], s, r; s = r = "" } }' OFS=, file

+1

Steve 18 Feb 13 at 13:50

source to share

bash

is probably the wrong tool for this. Try Python:

import fileinput
import sys

oldTime = None
for line in fileinput.input():
    line = line.strip()
    pos = line.find(',')
    time = line[0:pos]
    if oldTime == time:
        sys.stdout.write(',')
        sys.stdout.write(line[pos+1:])
    else:
        if oldTime is not None:
            sys.stdout.write('\n')
        sys.stdout.write(line)

    oldTime = time

sys.stdout.write('\n')

0

Aaron digulla 18 Feb 13 at 13:35

source to share

If this is an operation that you would like to repeat often, I would choose a script utility written in a more "complete" scripting language. Then you can call the script in your own bash script, or use it on the command line as needed.

Here's a Python example:

#!/usr/bin/env python
# --- merge_groups.py ----
import fileinput, operator, itertools
lines = (line.strip() for line in fileinput.input())
data = (line.split(",") for line in lines if line)
for key, group in itertools.groupby(data, operator.itemgetter(0)):
  _, label, value =  zip(*group)
  print "%s,%s,%s" % (key, "|".join(label), "|".join(value))

Note that the script assumes that entries with the same timestamp are already grouped.

You can use a script to process existing data files or pipe data directly to it, for example:

[me@home]$ ./merge_groups.py data.txt  # parse existing data file
Time,Tag,Value
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82

[me@home]$ cat data.txt | ./merge_groups.py  # post-process command output
Time,Tag,Value
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82

0

Shawn chin 18 Feb At 14:01

source to share

@AaronDigulla and @Kent have some good solutions, but if you have / would like to go with bash, here is one:

for ts in `cat inputfile | cut --delimiter="," --fields=1 | uniq`
do
  p1="";
  p2="";
  for line in `grep "^${ts}," inputfile | cut --delimiter="," --fields=2-`
  do
    f1=`echo ${line} | cut --delimiter="," --fields=1`;
    f2=`echo ${line} | cut --delimiter="," --fields=2`;
    p1=${p1}"|"$f1;
    p2=${p2}"|"$f2;
  done
  echo ${ts}","${p1#?}","${p2#?};
done

0

Raad 18 Feb At 14:10

source to share

Just in case, you want another awk solution!

function read() {
  split($0, buf, ",")
}

function write() {
  for (i = 1; i < length(buf); i++) {
    printf "%s,", buf[i]
  }
  print buf[length(buf)]
}

BEGIN {
  FS = ","
}

NR == 1 {
  print
  next
}

NR == 2 {
  read()
  next
}

{
  if ($1 != time) { # new time                                                                                                                                                                   
    time = $1
    write()
    read()
  } else { # repeated time                                                                                                                                                                       
    for (i = 2; i <= NF; i++) {
      buf[i] = buf[i] "|" $i
    }
  }
}

END {
  write()
}

I'm not that good with awk, so I had to emphasize readability!

0

John Zwinck 18 Feb At 14:10

source to share

Hmm, you really said "all the help", so this would include a Ruby solution?

require 'csv'

puts(CSV.read('f.csv').group_by(&:first).map do |k, v|
  t = v.transpose
  [k, t[1].join('|'), t[2].join('|')].join(',')
end.drop(1))

0

DigitalRoss 18 Feb 13 at 14:21

source to share

Perl is not represented.

use strict;
my $skip_header = <>;
my %d;
while(<>) {
    s/\s+$//;
    my ($no, $k, $v )  = split ",";
    push @{$d{int($no)}}, [ $k,  $v ];
}
END {
    foreach my $no (sort { $a <=> $b } keys %d  )  {
        print $no, ",";
        print join("|", map { $_->[0] } @{$d{$no}});
        print ",";
        print join("|", map { $_->[1] } @{$d{$no}});
        print "\n";
    }
}

gives:

1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82

0

sotapme 18 Feb 13 at 21:56

source to share

For the first option, you can try:

awk -F, 'p x!=$1{if(p x)print s; p=s=$1} {sub($1,x); s=s $0} END{print s}' file

0

Scrutinizer 18 Feb 13 at 22:04

source to share

the first:

> awk -F, '{a[$1]=a[$1]","$2","$3}END{for(i in a)print i","substr(a[i],2)}' temp | sort
1,ABC,3
2,ABC,2.7,DEF,3.4
3,ABC,2.8,DEF,3.6,GHI,2.99,JKL,3.01
4,ABC,3.42,DEF,3.62,JKL,3.82

second:

> awk -F, '{a[$1]=a[$1]"|"$2;b[$1]=b[$1]"|"$3}END{for(i in a)print i","substr(a[i],2)","substr(b[i],2)}' temp | sort
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82

0

Vijay Feb 19 At 7:02 am

source to share

In short: `sed`

way

sed -ne ':a;$!N;/^\([0-9]\+\),.*\n\1,/s/\n[0-9]*//;ta;P;D'
Time,Tag,Value
1,ABC,3
2,ABC,2.7,DEF,3.4
3,ABC,2.8,DEF,3.6,GHI,2.99,JKL,3.01
4,ABC,3.42,DEF,3.62,JKL,3.82

0

techno Feb 19 13 at 8:16

source to share

Thor · Accepted Answer · 2013-02-18T15:43:41+0000

Assuming your data is in chronological order, you can use this awk solution:

parse.awk

# Use comma as input and output field separators
BEGIN { FS = OFS = "," }

# Print header and skip to next line
NR == 1 { print; next }

# If previous timestamp is the same as current append tag and value
pt == $1 {
  tag = tag "|" $2
  val = val "|" $3
}

# If not the first data line and timestamps are not equal then print
NR != 2 && pt != $1 { print pt, tag, val }

# Save previous timestamp and reset accumulator variables    
pt != $1 {
  pt  = $1
  tag = $2
  val = $3
}

END { print pt, tag, val }

Run it like this:

awk -f parse.awk infile

Output:

Time,Tag,Value
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82

Or as a one-liner:

<infile awk 'BEGIN {FS=OFS=","} NR==1{print;next} pt==$1 {tag=tag"|"$2;val=val"|"$3} NR!=2&&pt!=$1 {print pt,tag,val} pt!=$1 {pt=$1;tag=$2;val=$3} END {print pt,tag,val}'

Sorting a log file using a connection

In short: sed (adsbygoogle = window.adsbygoogle || []).push({}); way

More articles:

In short: `sed`

way