Create multi-line record on one line if records are not delimited
I need to process records that span multiple lines. For example, I need to convert a multi-line entry to one string and then get everything I need. The entries are not delimited, so I can't just set RS
to \n\n
.
cat input
constant_string bla bla1
bla bla bal
fooo foooooo baaar #End of record 1
constant_string bla1 bla2
abcd cdfe fghi jkhil
foo bar bar bar bar bar bar #End of record 2
constant_string bla bla3
random data is present #End of record 3
To achieve this, I converted this non-demarcated data to demarcated data by adding a new line between the two records, for example:
awk '{gsub(/^constant_string/,"\n&")}1' input
constant_string bla bla1
bla bla bal
fooo foooooo baaar
constant_string bla1 bla2
abcd cdfe fghi jkhil
foo bar bar bar bar bar bar
constant_string bla bla3
random data is present
Once I get the demarcated entries, I can install RS
in \n\n
and do whatever I want.
awk '{gsub(/^constant_string/,"\n&")}1' input |awk -v RS= '{$1=$1}1'
constant_string bla bla1 bla bla bal fooo foooooo baaar
constant_string bla1 bla2 abcd cdfe fghi jkhil foo bar bar bar bar bar bar
constant_string bla bla3 random data is present
Question:
I can achieve a solution using two steps, is it possible to do it one step in awk?
I tried following but didn't work:
awk -v RS="" '{gsub(/^constant_string/,"\n&")}1' input
awk -v RS="" '{$0=gensub(/^constant_string/,"\n&",$0)}1' input
source to share
How about buffering and processing b
on the following constant_string
and END
? Using function
:
$ awk '
function process(str) { if(str!="") print str }
/^constant_string/ { process(b); b=$0; next }
{ b=b OFS $0 }
END { process(b) }
' file
constant_string bla bla1 bla bla bal fooo foooooo baaar
constant_string bla1 bla2 abcd cdfe fghi jkhil foo bar bar bar bar bar bar
constant_string bla bla3 random data is present
source to share
awk 'BEGIN{ RS="(^|\n)constant_string"}
# filtering to avoid "empty" record
/./ {
# $1 is first "word" (FS is default) AFTER your constant string that is
# "removed" of $0 as Record separator.
# Info, this is now a multiline record
#... treat what you want
print " -- " NR : [" $0 "]"
for (i=1;i<=NF;i++) print NR "." i " : " $i
}
' YourFile
Note:
- depends on awk version, posix seems to take the RS string as any char inside the string as delimiter, where gawk takes the string itself (regex in this case)
- check your string_const to avoid the special chhar that is the regex metacharacter.
source to share