Extended `uniq` with" unique regexp "

Question

Extended `uniq` with" unique regexp "

uniq

is a tool that allows you to filter the lines once in a file so that only unique lines are displayed. uniq

has some support to indicate when two strings are "equivalent", but the options are limited.

I'm looking for a tool / extension on uniq

that allows regular expression input. If the captured group is the same for two lines, then the two lines are considered "equivalent". For each equivalence class, only the "first match" is returned.

Example :

file.dat

:

foo!bar!baz
!baz!quix
!bar!foobar
ID!baz!

Using grep -P '(!\w+!)' -o

, you can extract the "unique parts":

!bar!
!baz!
!bar!
!baz!

This means that the first line is considered "equivalent" to the third, and the second to the fourth. Thus, only the first and second are printed (the third and fourth are ignored).

Then uniq '(!\w+!)' < file.dat

should return:

foo!bar!baz
!baz!quix

+3

linux regex shell awk uniq

Willem van onsem 29 oct. 14 at 14:47

source to share

3 answers

Here's a simple Perl script that will get the job done:

#!/usr/bin/env perl
use strict;
use warnings;

my $re = qr($ARGV[0]);

my %matches;
while(<STDIN>) {
    next if $_ !~ $re;
    print if !$matches{$1};
    $matches{$1} = 1;
}

Using:

$ ./uniq.pl '(!\w+!)' < file.dat
foo!bar!baz
!baz!quix

Here I used $1

to match the first extracted group, but you can replace it with $&

to use the whole pattern. This script will filter out lines that don't match the regex, but you can tweak it if you need different behavior.

+2

Lucas Trzesniewski 29 oct. '14 at 15:20

source to share

You can only do this with grep

andsort

DATAFILE=file.dat

for match in $(grep -P '(!\w+!)' -o "$DATAFILE" | sort -u); do 
  grep -m1 "$match" "$DATAFILE";
done

Outputs:

foo!bar!baz
!baz!quix

+1

arco444 29 oct. 14 at 15:25

source to share

anubhava · Accepted Answer · 2014-10-29T15:19:21+0000

Not to use uniq

, but with gnu-awk you can get the results you want:

awk -v re='![[:alnum:]]+!' 'match($0, re, a) && !(a[0] in p) {p[a[0]]; print}' file
foo!bar!baz
!baz!quix

Passing the required regex using a command line variable -v re=...
match

the function matches a regular expression for each line and returns the matched text in [a]
Each time it match

succeeds, we store the matched text in an associative array p

and print
So it is efficient to get a uniq

function with supportregex

Extended `uniq` with" unique regexp "

More articles: