How to remove partial duplicate lines using AWK?
I have files with duplicate lines like this, where only the last field is different:
OST,0202000070,01-AUG-09,002735,6,0,0202000068,4520688,-1,0,0,0,0,0,55
ONE,0208076826,01-AUG-09,002332,316,3481.055935,0204330827,29150,200,0,0,0,0,0,5
ONE,0208076826,01-AUG-09,002332,316,3481.055935,0204330827,29150,200,0,0,0,0,0,55
OST,0202000068,01-AUG-09,003019,6,0,0202000071,4520690,-1,0,0,0,0,0,55
I need to remove the first occurrence of a string and leave the second one.
I tried:
awk '!x[$0]++ {getline; print $0}' file.csv
but it doesn't work as intended as it also removes non-duplicate rows.
+2
source to share
3 answers
If your nearly duplicates are always contiguous, you can simply compare with the previous entry and not create a potentially huge associative array.
#!/bin/awk -f
{
s = substr($0, 0, match($0, /,[^,]*$/))
if (s != prev) {
print prev0
}
prev = s
prev0 = $0
}
END {
print $0
}
Edit . Changed the script so it prints the last one in a group of almost duplicates (not required tac
).
+1
source to share
As a general strategy (I'm not much of an AWK pro, despite doing Aho), you can try:
- Combine all fields except the last one.
- Use this string as a key to your hash.
- Store the entire string as a value to a hash.
- When you process all the lines, scroll through the hash printout of the value.
This is not AWK and I cannot easily provide any example code, but this is what I would try first.
+1
source to share