Grep - why should there be word boundaries around backlinks?

Question

Grep - why should there be word boundaries around backlinks?

I'm just wondering why grep is matching things this way.

For example, let's say I'm trying to find a word that appears twice in a sentence (and not as part of other words). So I'm trying to find lines like:

hello everybody hello

and not like this:

hello everybody hellopeople

Then why does the following grep expression work:

grep -E '(\<.*\>).*\<\1\>' file

not the following:

grep -E '(\<.*\>).*\1' file

I would have thought the second would work because the word boundaries (\ <and \>) are inside the parentheses for the second match, but it doesn't. It seems rather confusing to put word boundaries around the backlink, can anyone explain why grep matches lines in this way, or maybe go into more detail on this idea?

+3

bash grep backreference word-boundary

Bolboa Dec 27. 14 at 20:19

source to share

2 answers

You can use awk

to solve this problem if you don't get any good solution grep

.

awk '{for (i=1;i<=NF;i++) if (a[$i]++) print $i;delete a}'
hello

If the word exists more than once in the line, print it.

-2

Jotne Dec 27. 14 at 20:24

source to share

Kent · Accepted Answer · 2014-12-27T21:01:16+0000

zero-width match / zero-length match cannot be captured in a capture group. \b or \< \>

correspond to zero length. It cannot be captured in a group. Same as zero-width assertion, like looking behind / ahead.

eg:

((?<=#)\w+(?=#)).*\1

will match the string

#hello# everybody hellofoo

PS you can use \w+

instead .*

inside your word boundaries.

Grep - why should there be word boundaries around backlinks?

More articles: