Awk script to match the pattern and then remove the whole line after the delimiter

Question

Awk script to match the pattern and then remove the whole line after the delimiter

I have a file that has multiple lines with alphanumeric strings like ZINC123345667_123 followed by other lines. Now I need to remove the digits after the "_" separator only on lines with lines containing "ZINC" and the rest of the remaining lines remain unchanged. I tried using the following awk command, but only got lines with "ZINC" and not other lines.

My original data:

 Name:      ZINC00000036_1
 Grid Score:          -23.170839
 Grid_vdw:          -22.304409
 Grid_es:           -0.866430
 Int_energy:            4.932559

@<TRIPOS>MOLECULE
ZINC00000036_1
 18 18 1 0 0

Name:       ZINC00000053_3
 Grid Score:          -23.739523
 Grid_vdw:          -22.876204
 Grid_es:           -0.863320
 Int_energy:            9.981080

@<TRIPOS>MOLECULE
ZINC00000053_3
 20 20 1 0 0

 Name:      ZINC00000351_12
 Grid Score:          -30.763229
 Grid_vdw:          -27.735493
 Grid_es:           -3.027738
 Int_energy:            4.097543

@<TRIPOS>MOLECULE
ZINC00000351_12
 31 31 1 0 0

I have executed below awk script

awk -F'_' '/ZINC/ {print $1}' data.file > out.file

The result is:

Name:       ZINC00000036
ZINC00000036
Name:       ZINC00000053
ZINC00000053
Name:       ZINC00000351
ZINC00000351

But I need other lines in the output file as well as below:

 Name:      ZINC00000036
 Grid Score:          -23.170839
 Grid_vdw:          -22.304409
 Grid_es:           -0.866430
 Int_energy:            4.932559

@<TRIPOS>MOLECULE ZINC00000036  18 18 1 0 0

 Name:      ZINC00000053
 Grid Score:          -23.739523
 Grid_vdw:          -22.876204
 Grid_es:           -0.863320
 Int_energy:            9.981080

@<TRIPOS>MOLECULE ZINC00000053  20 20 1 0 0

 Name:      ZINC00000351
 Grid Score:          -30.763229
 Grid_vdw:          -27.735493
 Grid_es:           -3.027738
 Int_energy:            4.097543

@<TRIPOS>MOLECULE ZINC00000351  31 31 1 0 0

Since my datafile is huge and conversion will not be possible, I would greatly appreciate any help with awk.

+3

linux awk sed

Asha 09 Aug 14 at 20:14

source to share

5 answers

sed '/ZINC/s/_.*//' file
awk '/ZINC/{sub(/_.*/,"")}1' file

+2

Ed morton 10 Aug '14 at 15:30

source to share

I would tackle this with sed

:

sed -E '/ZINC[0-9]+_/s/_.*//' yourfile

That says ... on any lines containing "ZINC" followed by some numbers followed by an underscore, replace (ie replace) the underscore and everything else on the line with nothing in yourfile

.

If you added -i

after command sed

, it allows editing in place without having to create a second file.

+1

Mark setchell 09 Aug 14 at 22:10

source to share

I don't think awk is the right tool for the job. A simple sed command will do it:

sed 's/\(ZINC[0-9]\{1,\}\)_[0-9]\{1,\}/\1/' file  # most portable
sed 's/\(ZINC[0-9]\+\)_[0-9]\+/\1/' file          # GNU sed
sed -E 's/(ZINC[0-9]+)_[0-9]+/\1/' file           # extended regex mode

Grab the part before the underscore (ZINC, then some numbers) and discard the rest.

The same in Perl, which is marginally shorter due to the character class \d

:

perl -pe 's/(ZINC\d+)_\d+/$1/' file

Think about it, if you decide to use awk this will work:

awk -F_ '/ZINC/{$0=$1}1' file

When ZINC

matched, rewrite the string with the content of the first field. 1

at the end ensures that every line is printed.

+1

Tom fenech 09 Aug 14 at 23:09

source to share

Another response format using sed,

sed 's/\(ZINC[0-9]*\)\(_.*\)/\1/g' inputfile

Replaces the entire string with the first half of the pattern. The rest of the remaining lines will be displayed

0

Sadhun 02 Mar 15 at 13:40

source to share

user000001 · Accepted Answer · 2014-08-09T20:19:37+0000

To keep only the portion before the first underscore _

on lines containing ZINC

and leave the rest of the lines in a tick, you can do:

awk -F'_' '/ZINC/{print $1;next}1' file

Awk script to match the pattern and then remove the whole line after the delimiter

More articles: