How to extract characters before a pattern

Question

How to extract characters before a pattern

I need help on how to extract a specific string of a string.

I have a file with thousands of lines, for example:

Eukaryota; Alveolata; Ciliophora; Intramacronucleata; Paramecium#
Eukaryota; Viridiplantae; Streptophyta; Embryophyta#
Bacteria; Cyanobacteria; Synechococcales; Acaryochloridaceae; Acaryochloris#
Eukaryota; Viridiplantae# 
Bacteria; Proteobacteria; Alphaproteobacteria#

And I would like to get the first and last element of each row. So the output will be:

Eukaryota; Paramecium#
Eukaryota; Embryophyta#
Bacteria; Acaryochloris#
Eukaryota; Viridiplantae# 
Bacteria; Alphaproteobacteria#

I know how to get the 1st column with

awk '{print$1}' fileIn > fileOut

but I dont know how to get the last item as it is always in different columns.

I tried adding # and then just keeping the XX characters in front of the # with

grep -E -o '.{X,X}PATTERN. fileIn > fileOut

where the output looks like this: le; Sulfolobaceae; Sulfolobus #; Thermoproteaceae; Caldivirga # le; Haloferacaceae; Haloferax # Haloferacaceae; Haloquadratum # Ale; Natrialbaceae; Natrialba #

But then I have to repeat the procedure and delete; until I stay with the last one.

I have a search to see if there is any grep or awk option, extract the 1st and last columns, or extract only the characters attached to #, but I couldn't find this work for me.

I would be grateful for any suggestions on how to proceed.

Thank.

+3

regex grep awk

vimac Jul 26 17 at 8:18

source to share

4 answers

James brown · Answer 1 · 2017-07-26T08:31:25+0000

$ awk 'BEGIN{FS=OFS=";"}{print $1,$NF}' file
Eukaryota; Paramecium#
Eukaryota; Embryophyta#
Bacteria; Acaryochloris#
Eukaryota; Viridiplantae# 
Bacteria; Alphaproteobacteria#

CWLiu · Answer 2 · 2017-07-26T08:36:25+0000

Since the delimiter is in your file ;

, you can also use gsub(/;.*;/,";",$0)

to filter a field between two ;

to get the first and last field.

$ awk '{gsub(/;.*;/,";")}1' fileIn > fileOut
$ cat fileOut
Eukaryota; Paramecium#
Eukaryota; Embryophyta#
Bacteria; Acaryochloris#
Eukaryota; Viridiplantae# 
Bacteria; Alphaproteobacteria#

Claes wikner · Answer 3 · 2017-07-26T10:48:21+0000

awk '{print $1,$NF}' file

Eukaryota; Paramecium#
Eukaryota; Embryophyta#
Bacteria; Acaryochloris#
Eukaryota; Viridiplantae#
Bacteria; Alphaproteobacteria#

mkHun · Answer 4 · 2017-07-26T10:14:17+0000

You can try the following Perl one liner

perl -aF';' -ne 'print "$F[0],$F[-1]"' test.txt

-a

Automatic split mode

-F';'

Setting the separator as;

And the split data stored in the array @F

$F[0]

Contains the first column (first index)

$F[-1]

Contains the last column (last index)

How to extract characters before a pattern

More articles: