Gsub awk problem (gawk)

I need to find a text file for a string and make a replacement that includes a number that increases with each match.

The string to be "found" can be a single character, word, or phrase.

The replacement expression will not always be the same (as in my examples below), but will always include a number (variable) that is incremented.

For example:

1) I have a test file named "data.txt". The file contains:

Now is the time
for all good men
to come to the
aid of their party.

      

2) I put the awk script in a file called "cmd.awk". The file contains:

/f/ {sub ("f","f(" ++j ")")}1

      

3) I am using awk like this:

awk -f cmd.awk data.txt

      

In this case, the output will be as expected:

Now is the time
f(1)or all good men
to come to the
aid of(2) their party.

      

The problem occurs when there is more than one match in the string. For example, if I was looking for the letter "i", for example:

/i/ {sub ("i","i(" ++j ")")}1

      

Output:

Now i(1)s the time
for all good men
to come to the
ai(2)d of their party.

      

which is wrong because it does not include "i" in "time" or "them".

So, I tried "gsub" instead of "sub", for example:

/i/ {gsub ("i","i(" ++j ")")}1

      

Output:

Now i(1)s the ti(1)me
for all good men
to come to the
ai(2)d of thei(2)r party.

      

It now replaces all occurrences of the letter "i", but the inserted number is the same for all matches on the same line.

The required output should be:

Now i(1)s the ti(2)me
for all good men
to come to the
ai(3)d of thei(4)r party.

      

Note. The number doesn't always start with "1", so I can use awk like this:

awk -f cmd.awk -v j=26 data.txt

      

To get the result:

Now i(27)s the ti(28)me
for all good men
to come to the
ai(29)d of thei(30)r party.

      

And just to be clear, the replacement number will not always be inside the parentheses. And the replacement won't always include the matched string (in fact, that would be pretty rare).

Another problem I am facing is ...

I want to use an awk variable (not an environment variable) for the "search string" so I can specify it on the awk command line.

For example:

1) I put the awk script in a file called "cmd.awk". The file contains something like:

/??a??/ {gsub (a,a "(" ++j ")")}1

      

2) I would use awk like this:

awk -f cmd.awk -v a=i data.txt

      

To get the result:

Now i(1)s the ti(2)me
for all good men
to come to the
ai(3)d of thei(4)r party.

      

The question here is how to represent the variable "a" in / search / expression?

+3


source to share


3 answers


awk version:



awk '{for(i=2; i<=NF; i++)$i="(" ++k ")" $i}1' FS=i OFS=i

      

+2


source


gensub()

sounds perfect, it allows you to replace the Nth match, so what seems like a solution is to iterate over the string in a loop do{}while()

, replacing one match at a time and incrementing j

. This simple approach gensub()

will not work if the replacement does not contain the original text (or worse, contains it multiple times), see below.

So in awk, which lacks the " s///e

" perl evaluation function , and its regex modifier /g

(as used by Steve), the best remaining option is to split the lines into chunks (head, match, tail) and put them back again:

BEGIN { 
    if (j=="") j=1
    if (a=="") a="f"
}
match($0,a) { 
    str=$0; newstr=""
    do {
         newstr=newstr substr(str,1,RSTART-1) # head
         mm=substr(str,RSTART,RLENGTH)        # extract match
         sub(a,a"("j++")",mm)                 # replace
         newstr=newstr mm 
         str=substr(str,RSTART+RLENGTH)       # tail
    } while (match(str,a))
    $0=newstr str     
}
{print}

      

Used match()

as an epxression instead of a template //

so you can use a variable. (You can also just use " ($0 ~ a) { ... }

", but the results match()

are used in this code, so don't try that here.)

You can define j

it a

on the command line as well.

gawk

supports \y

, which is the equivalent of perlre \b

, and also supports \<

and \>

, to explicitly match the beginning and end of a word, just take care of adding extra screens from unix (I'm not entirely sure what Windows might require or allow).




Limited gensub()

edition

As mentioned above:

match($0,a) {
    idx=1; str=$0
    do {
        prev=str
        str=gensub(a,a"(" j ")",idx++,prev)
    } while (str!=prev && j++)
    $0=str
}

      

Problems here:

  • if you replace the substring " i

    " with the substring " k

    " or " k(1)

    " then the index gensub()

    for the next match will be off by 1. You can get around this if you either know it beforehand or work back through the string instead.
  • if you replace the substring " i

    " with the substring " ii

    " or " ii(i)

    ", a similar problem occurs (resulting in an infinite loop as it gensub()

    continues to find a new match)

Dealing with both conditions is reliably not worth the code.

+2


source


I am not saying that this cannot be done with help awk

, but I would strongly suggest switching to a more powerful language. Use instead perl

.

To enable a counter i

starting at 26 try:

perl -spe 's:i:$&."(".++$x.")":ge' -- -x=26 data.txt

      

It can also be wrapped around var:

var=26
perl -spe 's:i:$&."(".++$x.")":ge' -- -x=$var data.txt

      

Results:

Now i(27)s the ti(28)me
for all good men
to come to the
ai(29)d of thei(30)r party.

      


To enable counting specific words, add word boundaries (i.e. \b

) around words, try:

perl -spe 's:\bthe\b:$&."(".++$x.")":ge' -- -x=5 data.txt

      

Results:

Now is the(6) time
for all good men
to come to the(7)
aid of their party.

      

+1


source







All Articles