Awk script to match the pattern and then remove the whole line after the delimiter

I have a file that has multiple lines with alphanumeric strings like ZINC123345667_123 followed by other lines. Now I need to remove the digits after the "_" separator only on lines with lines containing "ZINC" and the rest of the remaining lines remain unchanged. I tried using the following awk command, but only got lines with "ZINC" and not other lines.

My original data:

 Name:      ZINC00000036_1
 Grid Score:          -23.170839
 Grid_vdw:          -22.304409
 Grid_es:           -0.866430
 Int_energy:            4.932559

@<TRIPOS>MOLECULE
ZINC00000036_1
 18 18 1 0 0

Name:       ZINC00000053_3
 Grid Score:          -23.739523
 Grid_vdw:          -22.876204
 Grid_es:           -0.863320
 Int_energy:            9.981080

@<TRIPOS>MOLECULE
ZINC00000053_3
 20 20 1 0 0

 Name:      ZINC00000351_12
 Grid Score:          -30.763229
 Grid_vdw:          -27.735493
 Grid_es:           -3.027738
 Int_energy:            4.097543

@<TRIPOS>MOLECULE
ZINC00000351_12
 31 31 1 0 0

      

I have executed below awk script

awk -F'_' '/ZINC/ {print $1}' data.file > out.file

      

The result is:

Name:       ZINC00000036
ZINC00000036
Name:       ZINC00000053
ZINC00000053
Name:       ZINC00000351
ZINC00000351

      

But I need other lines in the output file as well as below:

 Name:      ZINC00000036
 Grid Score:          -23.170839
 Grid_vdw:          -22.304409
 Grid_es:           -0.866430
 Int_energy:            4.932559

@<TRIPOS>MOLECULE ZINC00000036  18 18 1 0 0

 Name:      ZINC00000053
 Grid Score:          -23.739523
 Grid_vdw:          -22.876204
 Grid_es:           -0.863320
 Int_energy:            9.981080

@<TRIPOS>MOLECULE ZINC00000053  20 20 1 0 0

 Name:      ZINC00000351
 Grid Score:          -30.763229
 Grid_vdw:          -27.735493
 Grid_es:           -3.027738
 Int_energy:            4.097543

@<TRIPOS>MOLECULE ZINC00000351  31 31 1 0 0

      

Since my datafile is huge and conversion will not be possible, I would greatly appreciate any help with awk.

+3


source to share


5 answers


To keep only the portion before the first underscore _

on lines containing ZINC

and leave the rest of the lines in a tick, you can do:



awk -F'_' '/ZINC/{print $1;next}1' file

      

0


source


sed '/ZINC/s/_.*//' file
awk '/ZINC/{sub(/_.*/,"")}1' file

      



+2


source


I would tackle this with sed

:

sed -E '/ZINC[0-9]+_/s/_.*//' yourfile

      

That says ... on any lines containing "ZINC" followed by some numbers followed by an underscore, replace (ie replace) the underscore and everything else on the line with nothing in yourfile

.

If you added -i

after command sed

, it allows editing in place without having to create a second file.

+1


source


I don't think awk is the right tool for the job. A simple sed command will do it:

sed 's/\(ZINC[0-9]\{1,\}\)_[0-9]\{1,\}/\1/' file  # most portable
sed 's/\(ZINC[0-9]\+\)_[0-9]\+/\1/' file          # GNU sed
sed -E 's/(ZINC[0-9]+)_[0-9]+/\1/' file           # extended regex mode

      

Grab the part before the underscore (ZINC, then some numbers) and discard the rest.

The same in Perl, which is marginally shorter due to the character class \d

:

perl -pe 's/(ZINC\d+)_\d+/$1/' file

      

Think about it, if you decide to use awk this will work:

awk -F_ '/ZINC/{$0=$1}1' file

      

When ZINC

matched, rewrite the string with the content of the first field. 1

at the end ensures that every line is printed.

+1


source


Another response format using sed,

sed 's/\(ZINC[0-9]*\)\(_.*\)/\1/g' inputfile

      

Replaces the entire string with the first half of the pattern. The rest of the remaining lines will be displayed

0


source







All Articles