Gsub awk problem (gawk)
I need to find a text file for a string and make a replacement that includes a number that increases with each match.
The string to be "found" can be a single character, word, or phrase.
The replacement expression will not always be the same (as in my examples below), but will always include a number (variable) that is incremented.
For example:
1) I have a test file named "data.txt". The file contains:
Now is the time
for all good men
to come to the
aid of their party.
2) I put the awk script in a file called "cmd.awk". The file contains:
/f/ {sub ("f","f(" ++j ")")}1
3) I am using awk like this:
awk -f cmd.awk data.txt
In this case, the output will be as expected:
Now is the time
f(1)or all good men
to come to the
aid of(2) their party.
The problem occurs when there is more than one match in the string. For example, if I was looking for the letter "i", for example:
/i/ {sub ("i","i(" ++j ")")}1
Output:
Now i(1)s the time
for all good men
to come to the
ai(2)d of their party.
which is wrong because it does not include "i" in "time" or "them".
So, I tried "gsub" instead of "sub", for example:
/i/ {gsub ("i","i(" ++j ")")}1
Output:
Now i(1)s the ti(1)me
for all good men
to come to the
ai(2)d of thei(2)r party.
It now replaces all occurrences of the letter "i", but the inserted number is the same for all matches on the same line.
The required output should be:
Now i(1)s the ti(2)me
for all good men
to come to the
ai(3)d of thei(4)r party.
Note. The number doesn't always start with "1", so I can use awk like this:
awk -f cmd.awk -v j=26 data.txt
To get the result:
Now i(27)s the ti(28)me
for all good men
to come to the
ai(29)d of thei(30)r party.
And just to be clear, the replacement number will not always be inside the parentheses. And the replacement won't always include the matched string (in fact, that would be pretty rare).
Another problem I am facing is ...
I want to use an awk variable (not an environment variable) for the "search string" so I can specify it on the awk command line.
For example:
1) I put the awk script in a file called "cmd.awk". The file contains something like:
/??a??/ {gsub (a,a "(" ++j ")")}1
2) I would use awk like this:
awk -f cmd.awk -v a=i data.txt
To get the result:
Now i(1)s the ti(2)me
for all good men
to come to the
ai(3)d of thei(4)r party.
The question here is how to represent the variable "a" in / search / expression?
source to share
gensub()
sounds perfect, it allows you to replace the Nth match, so what seems like a solution is to iterate over the string in a loop do{}while()
, replacing one match at a time and incrementing j
. This simple approach gensub()
will not work if the replacement does not contain the original text (or worse, contains it multiple times), see below.
So in awk, which lacks the " s///e
" perl evaluation function , and its regex modifier /g
(as used by Steve), the best remaining option is to split the lines into chunks (head, match, tail) and put them back again:
BEGIN {
if (j=="") j=1
if (a=="") a="f"
}
match($0,a) {
str=$0; newstr=""
do {
newstr=newstr substr(str,1,RSTART-1) # head
mm=substr(str,RSTART,RLENGTH) # extract match
sub(a,a"("j++")",mm) # replace
newstr=newstr mm
str=substr(str,RSTART+RLENGTH) # tail
} while (match(str,a))
$0=newstr str
}
{print}
Used match()
as an epxression instead of a template //
so you can use a variable. (You can also just use " ($0 ~ a) { ... }
", but the results match()
are used in this code, so don't try that here.)
You can define j
it a
on the command line as well.
gawk
supports \y
, which is the equivalent of perlre \b
, and also supports \<
and \>
, to explicitly match the beginning and end of a word, just take care of adding extra screens from unix (I'm not entirely sure what Windows might require or allow).
Limited
gensub()
edition
As mentioned above:
match($0,a) {
idx=1; str=$0
do {
prev=str
str=gensub(a,a"(" j ")",idx++,prev)
} while (str!=prev && j++)
$0=str
}
Problems here:
- if you replace the substring "
i
" with the substring "k
" or "k(1)
" then the indexgensub()
for the next match will be off by 1. You can get around this if you either know it beforehand or work back through the string instead. - if you replace the substring "
i
" with the substring "ii
" or "ii(i)
", a similar problem occurs (resulting in an infinite loop as itgensub()
continues to find a new match)
Dealing with both conditions is reliably not worth the code.
source to share
I am not saying that this cannot be done with help awk
, but I would strongly suggest switching to a more powerful language. Use instead perl
.
To enable a counter i
starting at 26 try:
perl -spe 's:i:$&."(".++$x.")":ge' -- -x=26 data.txt
It can also be wrapped around var:
var=26
perl -spe 's:i:$&."(".++$x.")":ge' -- -x=$var data.txt
Results:
Now i(27)s the ti(28)me
for all good men
to come to the
ai(29)d of thei(30)r party.
To enable counting specific words, add word boundaries (i.e. \b
) around words, try:
perl -spe 's:\bthe\b:$&."(".++$x.")":ge' -- -x=5 data.txt
Results:
Now is the(6) time
for all good men
to come to the(7)
aid of their party.
source to share