Extended `uniq` with" unique regexp "
uniq
is a tool that allows you to filter the lines once in a file so that only unique lines are displayed. uniq
has some support to indicate when two strings are "equivalent", but the options are limited.
I'm looking for a tool / extension on uniq
that allows regular expression input. If the captured group is the same for two lines, then the two lines are considered "equivalent". For each equivalence class, only the "first match" is returned.
Example :
file.dat
:
foo!bar!baz
!baz!quix
!bar!foobar
ID!baz!
Using grep -P '(!\w+!)' -o
, you can extract the "unique parts":
!bar!
!baz!
!bar!
!baz!
This means that the first line is considered "equivalent" to the third, and the second to the fourth. Thus, only the first and second are printed (the third and fourth are ignored).
Then uniq '(!\w+!)' < file.dat
should return:
foo!bar!baz
!baz!quix
source to share
Not to use uniq
, but with gnu-awk you can get the results you want:
awk -v re='![[:alnum:]]+!' 'match($0, re, a) && !(a[0] in p) {p[a[0]]; print}' file
foo!bar!baz
!baz!quix
- Passing the required regex using a command line variable
-v re=...
-
match
the function matches a regular expression for each line and returns the matched text in[a]
- Each time it
match
succeeds, we store the matched text in an associative arrayp
and print - So it is efficient to get a
uniq
function with supportregex
source to share
Here's a simple Perl script that will get the job done:
#!/usr/bin/env perl
use strict;
use warnings;
my $re = qr($ARGV[0]);
my %matches;
while(<STDIN>) {
next if $_ !~ $re;
print if !$matches{$1};
$matches{$1} = 1;
}
Using:
$ ./uniq.pl '(!\w+!)' < file.dat
foo!bar!baz
!baz!quix
Here I used $1
to match the first extracted group, but you can replace it with $&
to use the whole pattern. This script will filter out lines that don't match the regex, but you can tweak it if you need different behavior.
source to share