Grep - why should there be word boundaries around backlinks?

I'm just wondering why grep is matching things this way.

For example, let's say I'm trying to find a word that appears twice in a sentence (and not as part of other words). So I'm trying to find lines like:

hello everybody hello

      

and not like this:

hello everybody hellopeople 

      

Then why does the following grep expression work:

grep -E '(\<.*\>).*\<\1\>' file

      

not the following:

grep -E '(\<.*\>).*\1' file

      

I would have thought the second would work because the word boundaries (\ <and \>) are inside the parentheses for the second match, but it doesn't. It seems rather confusing to put word boundaries around the backlink, can anyone explain why grep matches lines in this way, or maybe go into more detail on this idea?

+3


source to share


2 answers


zero-width match / zero-length match cannot be captured in a capture group. \b or \< \>

correspond to zero length. It cannot be captured in a group. Same as zero-width assertion, like looking behind / ahead.

eg:

((?<=#)\w+(?=#)).*\1

      



will match the string

#hello# everybody hellofoo

      

PS you can use \w+

instead .*

inside your word boundaries.

+4


source


You can use awk

to solve this problem if you don't get any good solution grep

.

awk '{for (i=1;i<=NF;i++) if (a[$i]++) print $i;delete a}'
hello

      



If the word exists more than once in the line, print it.

-2


source







All Articles