Remove duplicate lines from file / grep

Question

Remove duplicate lines from file / grep

I want to delete all rows where all second column 05408736032 is the same

0009300 | 05408736032 | 89 | 01 | 001 | 0 | 0 | 0 | 1 | NNNNNNYNNNNNNNNN | ASDF | 0009367 | 05408736032 | 89 | 01 | 001 | 0 | 0 | 0 | 1 | NNNNNNYNNNNNNNNN | adff |

these lines are not sequential. Its fine to remove all lines. I don't need to keep one of them.

Sorry my unix fu is really weak from non-use :).

+2

sorting scripting unix shell

Surya 17 Sep 09 at 16:11

source to share

9 replies

If all your input is formatted like above, that is, the fields are fixed size and the order of the lines in the output doesn't matter, sort --key=8,19 --unique

should do the trick. If order matters, but repeated lines are always in a row, this uniq -s 8 -w 11

will work. If the fields are not fixed, but the repeated lines are always sequential, then a Pax awk script will work. In the most general case, we are probably looking at something too complex for a single liner.

+8

moonshadow 17 Sep 09 at 16:25

source to share

Assuming they are sequential and you want to remove subsequent ones, the following awk script will do it:

awk -F'|' 'NR==1 {print;x=$2} NR>1 {if ($2 != x) {print;x=$2}}'

It works by printing the first line and keeping the second column. Then, for subsequent lines, it skips those where the stored value and the second column are the same (if they are different, it prints the line and updates the stored value).

If they are not consistent, I would choose a Perl solution where you maintain an associative array to detect and remove duplicates - I would code it, but my daughter 3yo just woke up at midnight and she has a cold - see you tomorrow if I survive the night: -)

+2

paxdiablo 17 Sep 09 at 16:23

source to share

This is the code that is used to remove duplicate words in a string.

awk '{for (i=1; i<=NF; i++) {x=0; for(j=i-1; j>=1; j--) {if ($i == $j){x=1} } if( x != 1){printf ("%s ", $i) }}print ""}' sent

+2

Shobhit gupta 11 oct. 11 at 8:59 am

source to share

Unix includes python, so the following few lines might just be what you need:

f=open('input.txt','rt')
d={}
for s in f.readlines():
  l=s.split('|')
  if l[2] not in d:
    print s
    d[l[2]]=True

This will work without the required fixed length, and even if the same values are not neighbors.

+1

redtuna 17 Sep 09 at 16:29

source to share

this awk will only print lines where the second column is not 05408736032

awk '{if($2!=05408736032}{print}' filename

0

Vidyadhar bhat 17 Sep 09 at 17:14

source to share

Performs two passes over the input file: 1) finds duplicate values, 2) removes them

awk -F\| '
    {count[$2]++} 
    END {for (x in count) {if (count[x] > 1) {print x}}}
' input.txt >input.txt.dups

awk -F\| '
    NR==FNR {dup[$1]++; next}
    !($2 in dup) {print}
' input.txt.dups input.txt

If you are using bash you can omit the temp file: concatenate into one line using process substitution: (deep breath)

awk -F\| 'NR==FNR {dup[$1]++; next} !($2 in dup) {print}' <(awk -F\| '{count[$2]++} END {for (x in count) {if (count[x] > 1) {print x}}}' input.txt) input.txt

(ugh!)

0

glenn jackman 17 Sep At 18:04

source to share

awk -F"|" '!_[$2]++' file

0

ghostdog74 Sep 18 '09 at 4:41

source to share

Put the strings in a hash using the string as key and value, then iterate over the hash (this should work in almost any programming language, awk, perl, etc.)

0

helpermethod Jan 20 10 at 11:06

source to share

daveb · Accepted Answer · 2009-09-17T17:37:26+0000

If the columns are not fixed width, you can still use sort:

sort -t '|' --key=10,10 -g FILENAME

The flag will -t

set the separator.
-g

intended for natural numerical ordering.

Remove duplicate lines from file / grep

More articles: