Finding specific fields in lines in a file
I have a file that contains data like:
0000380000000101 0000650000000201 0000650000000301 0000650000000401 0001000000000101 0001000000000201
.... etc. I want to process this data so that I get output as
000065 0000000201 0000000301 0000000401
000100 0000000101 0000000201
As 000065 repeats three times, in the output I want 000065 to appear only once, and the corresponding bytes in each entry, wherever 000065 is, should be printed. Since 000038 came only once, I don't want that in the exit. In this example, the data (i.e. 000065 or 000038 is 3 bytes, although it can be of any length, whereas bytes after that as 0000000401 will have a fixed length, i.e. 5 bytes). I want to do this preferably using shell scripts or c. please let me know how I can do this. can awk be useful here? Any help would be appreciated. Below is the data from the actual file that I want to process:
0000000000000101 0000000000000201 0000000000000301 0000000000000401 0000380000000101 0000650000000201 0000650000000301 0000650000000401 0001000000000101 0001000000000201 0001000000000301 0001000000000401 0038d30000000101 00652e0000000201 00652e0000000301 00652e0000000401 008d750000000101 008d750000000201 008d750000000301 008d750000000401 0100010000000101 0100010000000201 0100010000000301 0100010000000401 01008d0000000101 01008d0000000201 01008d0000000301 01008d0000000401 01a8c00000000101 01a8c00000000201 01a8c00000000301 01a8c00000000401 0264010000000101 0264010000000201 0264010000000301 0264010000000401 0615df0000000101 0615df0000000201 0615df0000000301 0615df0000000401 07dd940000000101 07dd940000000201 07dd940000000301 07dd940000000401 0900000000000101 0900000000000201 0900000000000301 0900000000000401 15dfc70000000101 15dfc70000000201 15dfc70000000301 15dfc70000000401 1ecf090000000101
source to share
Your data is of a fixed width, so you can use gawk
:
$ gawk -v FIELDWIDTHS='6 10' 'NR!=1 && x==$1""{printf(" %s", $2); next}; {x=$1""; printf("%s%s %s", NR==1?"":"\n", $1, $2)}; END{print ""}' input.txt | sed '/^[0-9a-f]* [0-9a-f]*$/d'
000000 0000000101 0000000201 0000000301 0000000401
000065 0000000201 0000000301 0000000401
000100 0000000101 0000000201 0000000301 0000000401
00652e 0000000201 0000000301 0000000401
008d75 0000000101 0000000201 0000000301 0000000401
010001 0000000101 0000000201 0000000301 0000000401
01008d 0000000101 0000000201 0000000301 0000000401
01a8c0 0000000101 0000000201 0000000301 0000000401
026401 0000000101 0000000201 0000000301 0000000401
0615df 0000000101 0000000201 0000000301 0000000401
07dd94 0000000101 0000000201 0000000301 0000000401
090000 0000000101 0000000201 0000000301 0000000401
15dfc7 0000000101 0000000201 0000000301 0000000401
FIELDWIDTHS A white-space separated list of fieldwidths. When set, gawk parses the input into fields of fixed width, instead of using the value
of the FS variable as the field separator.
source to share
You can run awk command (tested on Linux and Mac):
awk '{key=substr($0, 0, 6); val=substr($0, 6); arr[key]=sprintf("%s %s",val,arr[key]);}
END{for (a in arr) {split(arr[a], el, " "); if (length(el)>1) print a, arr[a]} }' file
OUTPUT:
000065 50000000401 50000000301 50000000201
000100 00000000201 00000000101
source to share
First, loop through your datafile with this:
awk '{suffixLen = 10; print substr($0, 1, length($0) - suffixLen)" "substr($0, length($0) - suffixLen + 1, length($0))}'
The Len variable suffix is ββa (fixed) number of characters to return: 2 bytes for each char = 10. This will split the input string into two space-separated fields.
Then run this through:
awk '{if ($1 in values) {values[$1] = values[$1]" "$2} else {values[$1] = $1" "$2}}END{for (v in values) print values[v]}'
Sorting the result correctly is left as an exercise for the reader.
source to share
awk with FIELDWIDTHS
is one way as shown on the screen.
here is another way (oneliner) with awk only:
awk 'BEGIN{FS=""}
{for(i=1;i<=6;i++) x=x$i; y=$0; gsub("^"x,"",y);a[x]=a[x]?a[x]" "y:y; x="";}
END{for(t in a)print t" "a[t]}' yourFile
check your small data block:
kent$ echo "0000380000000101
0000650000000201
0000650000000301
0000650000000401
0001000000000101
0001000000000201"|awk 'BEGIN{FS=""} {for(i=1;i<=6;i++) x=x$i; y=$0; gsub("^"x,"",y);a[x]=a[x]?a[x]" "y:y; x="";}END{for(t in a)print t" "a[t]}'
000100 0000000101 0000000201
000065 0000000201 0000000301 0000000401
000038 0000000101
source to share