Finding specific fields in lines in a file

Question

Finding specific fields in lines in a file

I have a file that contains data like:

0000380000000101
0000650000000201
0000650000000301
0000650000000401
0001000000000101
0001000000000201

.... etc. I want to process this data so that I get output as

000065 0000000201 0000000301 0000000401  
000100 0000000101 0000000201

As 000065 repeats three times, in the output I want 000065 to appear only once, and the corresponding bytes in each entry, wherever 000065 is, should be printed. Since 000038 came only once, I don't want that in the exit. In this example, the data (i.e. 000065 or 000038 is 3 bytes, although it can be of any length, whereas bytes after that as 0000000401 will have a fixed length, i.e. 5 bytes). I want to do this preferably using shell scripts or c. please let me know how I can do this. can awk be useful here? Any help would be appreciated. Below is the data from the actual file that I want to process:

0000000000000101
0000000000000201
0000000000000301
0000000000000401
0000380000000101
0000650000000201
0000650000000301
0000650000000401
0001000000000101
0001000000000201
0001000000000301
0001000000000401
0038d30000000101
00652e0000000201
00652e0000000301
00652e0000000401
008d750000000101
008d750000000201
008d750000000301
008d750000000401
0100010000000101
0100010000000201
0100010000000301
0100010000000401
01008d0000000101
01008d0000000201
01008d0000000301
01008d0000000401
01a8c00000000101
01a8c00000000201
01a8c00000000301
01a8c00000000401
0264010000000101
0264010000000201
0264010000000301
0264010000000401
0615df0000000101
0615df0000000201
0615df0000000301
0615df0000000401
07dd940000000101
07dd940000000201
07dd940000000301
07dd940000000401
0900000000000101
0900000000000201
0900000000000301
0900000000000401
15dfc70000000101
15dfc70000000201
15dfc70000000301
15dfc70000000401
1ecf090000000101

+3

bash shell

mezda 13 Mar 12 at 12:31

source to share

5 answers

Your data is of a fixed width, so you can use gawk

:

$ gawk -v FIELDWIDTHS='6 10' 'NR!=1 && x==$1""{printf(" %s", $2); next}; {x=$1""; printf("%s%s %s", NR==1?"":"\n", $1, $2)}; END{print ""}' input.txt | sed '/^[0-9a-f]* [0-9a-f]*$/d'
000000 0000000101 0000000201 0000000301 0000000401
000065 0000000201 0000000301 0000000401
000100 0000000101 0000000201 0000000301 0000000401
00652e 0000000201 0000000301 0000000401
008d75 0000000101 0000000201 0000000301 0000000401
010001 0000000101 0000000201 0000000301 0000000401
01008d 0000000101 0000000201 0000000301 0000000401
01a8c0 0000000101 0000000201 0000000301 0000000401
026401 0000000101 0000000201 0000000301 0000000401
0615df 0000000101 0000000201 0000000301 0000000401
07dd94 0000000101 0000000201 0000000301 0000000401
090000 0000000101 0000000201 0000000301 0000000401
15dfc7 0000000101 0000000201 0000000301 0000000401

FIELDWIDTHS    A white-space separated list of fieldwidths.  When set, gawk parses the input into fields of fixed width, instead of using  the  value
               of the FS variable as the field separator.

+4

kev 13 Mar 12 at 12:46

source to share

You can run awk command (tested on Linux and Mac):

awk '{key=substr($0, 0, 6); val=substr($0, 6); arr[key]=sprintf("%s %s",val,arr[key]);}
END{for (a in arr) {split(arr[a], el, " "); if (length(el)>1) print a, arr[a]} }' file

OUTPUT:

000065 50000000401 50000000301 50000000201 
000100 00000000201 00000000101

+2

anubhava 13 Mar 12 at 13:45

source to share

First, loop through your datafile with this:

awk '{suffixLen = 10; print substr($0, 1, length($0) - suffixLen)" "substr($0, length($0) - suffixLen + 1, length($0))}'

The Len variable suffix is a (fixed) number of characters to return: 2 bytes for each char = 10. This will split the input string into two space-separated fields.

Then run this through:

awk '{if ($1 in values) {values[$1] = values[$1]" "$2} else {values[$1] = $1" "$2}}END{for (v in values) print values[v]}'

Sorting the result correctly is left as an exercise for the reader.

+2

Jan 13 Mar 12 at 14:10

source to share

awk with FIELDWIDTHS

is one way as shown on the screen.

here is another way (oneliner) with awk only:

awk 'BEGIN{FS=""} 
  {for(i=1;i<=6;i++) x=x$i; y=$0; gsub("^"x,"",y);a[x]=a[x]?a[x]" "y:y;  x="";}
   END{for(t in a)print t" "a[t]}' yourFile

check your small data block:

kent$  echo "0000380000000101
0000650000000201
0000650000000301
0000650000000401
0001000000000101
0001000000000201"|awk 'BEGIN{FS=""} {for(i=1;i<=6;i++) x=x$i; y=$0; gsub("^"x,"",y);a[x]=a[x]?a[x]" "y:y;  x="";}END{for(t in a)print t" "a[t]}'

000100 0000000101 0000000201
000065 0000000201 0000000301 0000000401
000038 0000000101

+1

Kent 13 Mar 12 at 13:27

source to share

potong · Accepted Answer · 2012-03-13T13:22:08+0000

This might work for you (is sed OK?):

sed ':a;$!N;s/^\(.*\)\(\( *.\{10\}\)*\)\n\1/\1\2 /;ta;/ /!D;s/.\{10\} / &/;P;D' file
000065 0000000201 0000000301 000000401
000100 0000000101 0000000201

Finding specific fields in lines in a file

More articles: