Extended `uniq` with" unique regexp "

uniq

is a tool that allows you to filter the lines once in a file so that only unique lines are displayed. uniq

has some support to indicate when two strings are "equivalent", but the options are limited.

I'm looking for a tool / extension on uniq

that allows regular expression input. If the captured group is the same for two lines, then the two lines are considered "equivalent". For each equivalence class, only the "first match" is returned.

Example :

file.dat

:

foo!bar!baz
!baz!quix
!bar!foobar
ID!baz!

      

Using grep -P '(!\w+!)' -o

, you can extract the "unique parts":

!bar!
!baz!
!bar!
!baz!

      

This means that the first line is considered "equivalent" to the third, and the second to the fourth. Thus, only the first and second are printed (the third and fourth are ignored).

Then uniq '(!\w+!)' < file.dat

should return:

foo!bar!baz
!baz!quix

      

+3


source to share


3 answers


Not to use uniq

, but with gnu-awk you can get the results you want:

awk -v re='![[:alnum:]]+!' 'match($0, re, a) && !(a[0] in p) {p[a[0]]; print}' file
foo!bar!baz
!baz!quix

      



  • Passing the required regex using a command line variable -v re=...

  • match

    the function matches a regular expression for each line and returns the matched text in [a]

  • Each time it match

    succeeds, we store the matched text in an associative array p

    and print
  • So it is efficient to get a uniq

    function with supportregex

+2


source


Here's a simple Perl script that will get the job done:

#!/usr/bin/env perl
use strict;
use warnings;

my $re = qr($ARGV[0]);

my %matches;
while(<STDIN>) {
    next if $_ !~ $re;
    print if !$matches{$1};
    $matches{$1} = 1;
}

      

Using:



$ ./uniq.pl '(!\w+!)' < file.dat
foo!bar!baz
!baz!quix

      

Here I used $1

to match the first extracted group, but you can replace it with $&

to use the whole pattern. This script will filter out lines that don't match the regex, but you can tweak it if you need different behavior.

+2


source


You can only do this with grep

andsort

DATAFILE=file.dat

for match in $(grep -P '(!\w+!)' -o "$DATAFILE" | sort -u); do 
  grep -m1 "$match" "$DATAFILE";
done

      

Outputs:

foo!bar!baz
!baz!quix

      

+1


source







All Articles