Uniqing delimited file based on subset of fields

Question

Uniqing delimited file based on subset of fields

I have data like below:

1493992429103289,207.55,207.5
1493992429103559,207.55,207.5
1493992429104353,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5

Due to the nature of the last two columns, their values change throughout the day, and their values are repeated regularly. By concatenating the path given in my desired output (below), I can view every time their values have changed (with the enoch time in the first column). Is there a way to achieve the desired result shown below:

1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5

So, I am consolidating the data across two columns. However, the consolidation is not entirely unique (as seen in reiteration 207.55, 207.5)

I tried:

uniq -f 1

However, the output only prints the first line and doesn't go through the list

The awk solution below does not allow repetition of the output that was previously printed, and therefore gives the result (below the awk code):

awk '!x[$2 $3]++'

1493992429103289,207.55,207.5
1493992429104491,207.6,207.55

I dont want to sort the data by two two columns. However, since the first time is an epoch, it can be sorted by the first column.

+3

linux bash shell awk uniq

vinayman May 09 '17 at 12:11

source to share

5 answers

You can use the operator Awk

as below,

awk 'BEGIN{FS=OFS=","} s != $2 && t != $3 {print} {s=$2;t=$3}' file

which produces output as needed.

1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5

The idea is to store the values of the second and third columns in variables s

and t

accordingly print the contents of the row only if the current row is unique.

+2

Inian May 09 '17 at 12:34

source to share

I found an answer that is not as elegant as Inian, but serves my purpose. Since my first column is always enoch time in microseconds and does not increase or decrease in characters, I can use the following uniq command:

uniq -s 17

+1

vinayman May 09 '17 at 12:39

source to share

You can try manually (with a loop) comparing the current line with the previous line.

previous_line=""
# start at first line
i=1

# suppress first column, that don't need to compare
sed 's@^[0-9][0-9]*,@@' ./data_file > ./transform_data_file

# for all line within file without first column
for current_line in $(cat ./transform_data_file)
do 
  # if previous record line are same than current line
  if [ "x$prev_line" == "x$current_line" ]
  then
    # record line number to supress after
    echo $i >> ./line_to_be_suppress
  fi

  # record current line as previous line
  prev_line=$current_line

  # increment current number line
  i=$(( i + 1 ))
done

# suppress lines
for line_to_suppress in $(tac ./line_to_be_suppress) ; do sed -i $line_to_suppress'd' ./data_file ; done

rm line_to_be_suppress
rm transform_data_file

0

romaric crailox May 09 '17 at 12:40

source to share

Since your first field is a fixed length of 18 characters (including the separator ,

), you can use an option -s

uniq

that would be more optimal for large files:

uniq -s 18 file

Gives this output:

1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5

From man uniq

:

-f num

Ignore the first num fields on each line of input when doing comparisons. The field is a string of nonblank characters, separated from adjacent fields by spaces. Field numbers are based on one, i.e. The first field is a field.

-s chars

Ignore the first characters of characters in each line of input when doing comparisons. If specified in combination with the -f option, the first characters after the first num fields will be ignored. Character numbers are based on one, that is, the first character is a character character.

0

codeforester May 10 '17 at 2:35

source to share

karakfa · Accepted Answer · 2017-05-09T12:59:43+0000

You cannot set separators with uniq

, it must be empty space. With help tr

you can

tr ',' ' ' <file | uniq -f1 | tr ' ' ','

1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5

Uniqing delimited file based on subset of fields

More articles: