Awk script to match the pattern and then remove the whole line after the delimiter
I have a file that has multiple lines with alphanumeric strings like ZINC123345667_123 followed by other lines. Now I need to remove the digits after the "_" separator only on lines with lines containing "ZINC" and the rest of the remaining lines remain unchanged. I tried using the following awk command, but only got lines with "ZINC" and not other lines.
My original data:
Name: ZINC00000036_1
Grid Score: -23.170839
Grid_vdw: -22.304409
Grid_es: -0.866430
Int_energy: 4.932559
@<TRIPOS>MOLECULE
ZINC00000036_1
18 18 1 0 0
Name: ZINC00000053_3
Grid Score: -23.739523
Grid_vdw: -22.876204
Grid_es: -0.863320
Int_energy: 9.981080
@<TRIPOS>MOLECULE
ZINC00000053_3
20 20 1 0 0
Name: ZINC00000351_12
Grid Score: -30.763229
Grid_vdw: -27.735493
Grid_es: -3.027738
Int_energy: 4.097543
@<TRIPOS>MOLECULE
ZINC00000351_12
31 31 1 0 0
I have executed below awk script
awk -F'_' '/ZINC/ {print $1}' data.file > out.file
The result is:
Name: ZINC00000036
ZINC00000036
Name: ZINC00000053
ZINC00000053
Name: ZINC00000351
ZINC00000351
But I need other lines in the output file as well as below:
Name: ZINC00000036
Grid Score: -23.170839
Grid_vdw: -22.304409
Grid_es: -0.866430
Int_energy: 4.932559
@<TRIPOS>MOLECULE ZINC00000036 18 18 1 0 0
Name: ZINC00000053
Grid Score: -23.739523
Grid_vdw: -22.876204
Grid_es: -0.863320
Int_energy: 9.981080
@<TRIPOS>MOLECULE ZINC00000053 20 20 1 0 0
Name: ZINC00000351
Grid Score: -30.763229
Grid_vdw: -27.735493
Grid_es: -3.027738
Int_energy: 4.097543
@<TRIPOS>MOLECULE ZINC00000351 31 31 1 0 0
Since my datafile is huge and conversion will not be possible, I would greatly appreciate any help with awk.
source to share
I would tackle this with sed
:
sed -E '/ZINC[0-9]+_/s/_.*//' yourfile
That says ... on any lines containing "ZINC" followed by some numbers followed by an underscore, replace (ie replace) the underscore and everything else on the line with nothing in yourfile
.
If you added -i
after command sed
, it allows editing in place without having to create a second file.
source to share
I don't think awk is the right tool for the job. A simple sed command will do it:
sed 's/\(ZINC[0-9]\{1,\}\)_[0-9]\{1,\}/\1/' file # most portable
sed 's/\(ZINC[0-9]\+\)_[0-9]\+/\1/' file # GNU sed
sed -E 's/(ZINC[0-9]+)_[0-9]+/\1/' file # extended regex mode
Grab the part before the underscore (ZINC, then some numbers) and discard the rest.
The same in Perl, which is marginally shorter due to the character class \d
:
perl -pe 's/(ZINC\d+)_\d+/$1/' file
Think about it, if you decide to use awk this will work:
awk -F_ '/ZINC/{$0=$1}1' file
When ZINC
matched, rewrite the string with the content of the first field. 1
at the end ensures that every line is printed.
source to share