Remove duplicate lines from file / grep

I want to delete all rows where all second column 05408736032 is the same

0009300 | 05408736032 | 89 | 01 | 001 | 0 | 0 | 0 | 1 | NNNNNNYNNNNNNNNN | ASDF | 0009367 | 05408736032 | 89 | 01 | 001 | 0 | 0 | 0 | 1 | NNNNNNYNNNNNNNNN | adff |

these lines are not sequential. Its fine to remove all lines. I don't need to keep one of them.

Sorry my unix fu is really weak from non-use :).

+2


source to share


9 replies


If the columns are not fixed width, you can still use sort:

sort -t '|' --key=10,10 -g FILENAME

      



  • The flag will -t

    set the separator.
  • -g

    intended for natural numerical ordering.
+1


source


If all your input is formatted like above, that is, the fields are fixed size and the order of the lines in the output doesn't matter, sort --key=8,19 --unique

should do the trick. If order matters, but repeated lines are always in a row, this uniq -s 8 -w 11

will work. If the fields are not fixed, but the repeated lines are always sequential, then a Pax awk script will work. In the most general case, we are probably looking at something too complex for a single liner.



+8


source


Assuming they are sequential and you want to remove subsequent ones, the following awk script will do it:

awk -F'|' 'NR==1 {print;x=$2} NR>1 {if ($2 != x) {print;x=$2}}'

      

It works by printing the first line and keeping the second column. Then, for subsequent lines, it skips those where the stored value and the second column are the same (if they are different, it prints the line and updates the stored value).

If they are not consistent, I would choose a Perl solution where you maintain an associative array to detect and remove duplicates - I would code it, but my daughter 3yo just woke up at midnight and she has a cold - see you tomorrow if I survive the night: -)

+2


source


This is the code that is used to remove duplicate words in a string.

awk '{for (i=1; i<=NF; i++) {x=0; for(j=i-1; j>=1; j--) {if ($i == $j){x=1} } if( x != 1){printf ("%s ", $i) }}print ""}' sent

      

+2


source


Unix includes python, so the following few lines might just be what you need:

f=open('input.txt','rt')
d={}
for s in f.readlines():
  l=s.split('|')
  if l[2] not in d:
    print s
    d[l[2]]=True

      

This will work without the required fixed length, and even if the same values ​​are not neighbors.

+1


source


this awk will only print lines where the second column is not 05408736032

awk '{if($2!=05408736032}{print}' filename

      

0


source


Performs two passes over the input file: 1) finds duplicate values, 2) removes them

awk -F\| '
    {count[$2]++} 
    END {for (x in count) {if (count[x] > 1) {print x}}}
' input.txt >input.txt.dups

awk -F\| '
    NR==FNR {dup[$1]++; next}
    !($2 in dup) {print}
' input.txt.dups input.txt

      

If you are using bash you can omit the temp file: concatenate into one line using process substitution: (deep breath)

awk -F\| 'NR==FNR {dup[$1]++; next} !($2 in dup) {print}' <(awk -F\| '{count[$2]++} END {for (x in count) {if (count[x] > 1) {print x}}}' input.txt) input.txt

      

(ugh!)

0


source


awk -F"|" '!_[$2]++' file

      

0


source


Put the strings in a hash using the string as key and value, then iterate over the hash (this should work in almost any programming language, awk, perl, etc.)

0


source







All Articles