Remove duplicate lines from file / grep
I want to delete all rows where all second column 05408736032 is the same
0009300 | 05408736032 | 89 | 01 | 001 | 0 | 0 | 0 | 1 | NNNNNNYNNNNNNNNN | ASDF | 0009367 | 05408736032 | 89 | 01 | 001 | 0 | 0 | 0 | 1 | NNNNNNYNNNNNNNNN | adff |
these lines are not sequential. Its fine to remove all lines. I don't need to keep one of them.
Sorry my unix fu is really weak from non-use :).
source to share
If all your input is formatted like above, that is, the fields are fixed size and the order of the lines in the output doesn't matter, sort --key=8,19 --unique
should do the trick. If order matters, but repeated lines are always in a row, this uniq -s 8 -w 11
will work. If the fields are not fixed, but the repeated lines are always sequential, then a Pax awk script will work. In the most general case, we are probably looking at something too complex for a single liner.
source to share
Assuming they are sequential and you want to remove subsequent ones, the following awk script will do it:
awk -F'|' 'NR==1 {print;x=$2} NR>1 {if ($2 != x) {print;x=$2}}'
It works by printing the first line and keeping the second column. Then, for subsequent lines, it skips those where the stored value and the second column are the same (if they are different, it prints the line and updates the stored value).
If they are not consistent, I would choose a Perl solution where you maintain an associative array to detect and remove duplicates - I would code it, but my daughter 3yo just woke up at midnight and she has a cold - see you tomorrow if I survive the night: -)
source to share
Unix includes python, so the following few lines might just be what you need:
f=open('input.txt','rt')
d={}
for s in f.readlines():
l=s.split('|')
if l[2] not in d:
print s
d[l[2]]=True
This will work without the required fixed length, and even if the same values ββare not neighbors.
source to share
Performs two passes over the input file: 1) finds duplicate values, 2) removes them
awk -F\| '
{count[$2]++}
END {for (x in count) {if (count[x] > 1) {print x}}}
' input.txt >input.txt.dups
awk -F\| '
NR==FNR {dup[$1]++; next}
!($2 in dup) {print}
' input.txt.dups input.txt
If you are using bash you can omit the temp file: concatenate into one line using process substitution: (deep breath)
awk -F\| 'NR==FNR {dup[$1]++; next} !($2 in dup) {print}' <(awk -F\| '{count[$2]++} END {for (x in count) {if (count[x] > 1) {print x}}}' input.txt) input.txt
(ugh!)
source to share