Grep - why should there be word boundaries around backlinks?
I'm just wondering why grep is matching things this way.
For example, let's say I'm trying to find a word that appears twice in a sentence (and not as part of other words). So I'm trying to find lines like:
hello everybody hello
and not like this:
hello everybody hellopeople
Then why does the following grep expression work:
grep -E '(\<.*\>).*\<\1\>' file
not the following:
grep -E '(\<.*\>).*\1' file
I would have thought the second would work because the word boundaries (\ <and \>) are inside the parentheses for the second match, but it doesn't. It seems rather confusing to put word boundaries around the backlink, can anyone explain why grep matches lines in this way, or maybe go into more detail on this idea?
source to share
zero-width match / zero-length match cannot be captured in a capture group. \b or \< \>
correspond to zero length. It cannot be captured in a group. Same as zero-width assertion, like looking behind / ahead.
eg:
((?<=#)\w+(?=#)).*\1
will match the string
#hello# everybody hellofoo
PS you can use \w+
instead .*
inside your word boundaries.
source to share