How to extract characters before a pattern

I need help on how to extract a specific string of a string.

I have a file with thousands of lines, for example:

Eukaryota; Alveolata; Ciliophora; Intramacronucleata; Paramecium#
Eukaryota; Viridiplantae; Streptophyta; Embryophyta#
Bacteria; Cyanobacteria; Synechococcales; Acaryochloridaceae; Acaryochloris#
Eukaryota; Viridiplantae# 
Bacteria; Proteobacteria; Alphaproteobacteria#

      

And I would like to get the first and last element of each row. So the output will be:

Eukaryota; Paramecium#
Eukaryota; Embryophyta#
Bacteria; Acaryochloris#
Eukaryota; Viridiplantae# 
Bacteria; Alphaproteobacteria# 

      

I know how to get the 1st column with

awk '{print$1}' fileIn > fileOut

      

but I dont know how to get the last item as it is always in different columns.

I tried adding # and then just keeping the XX characters in front of the # with

grep -E -o '.{X,X}PATTERN. fileIn > fileOut

      

where the output looks like this: le; Sulfolobaceae; Sulfolobus #; Thermoproteaceae; Caldivirga # le; Haloferacaceae; Haloferax # Haloferacaceae; Haloquadratum # Ale; Natrialbaceae; Natrialba #

But then I have to repeat the procedure and delete; until I stay with the last one.

I have a search to see if there is any grep or awk option, extract the 1st and last columns, or extract only the characters attached to #, but I couldn't find this work for me.

I would be grateful for any suggestions on how to proceed.

Thank.

+3


source to share


4 answers


$ awk 'BEGIN{FS=OFS=";"}{print $1,$NF}' file
Eukaryota; Paramecium#
Eukaryota; Embryophyta#
Bacteria; Acaryochloris#
Eukaryota; Viridiplantae# 
Bacteria; Alphaproteobacteria#

      



+2


source


Since the delimiter is in your file ;

, you can also use gsub(/;.*;/,";",$0)

to filter a field between two ;

to get the first and last field.



$ awk '{gsub(/;.*;/,";")}1' fileIn > fileOut
$ cat fileOut
Eukaryota; Paramecium#
Eukaryota; Embryophyta#
Bacteria; Acaryochloris#
Eukaryota; Viridiplantae# 
Bacteria; Alphaproteobacteria#

      

+1


source


awk '{print $1,$NF}' file

Eukaryota; Paramecium#
Eukaryota; Embryophyta#
Bacteria; Acaryochloris#
Eukaryota; Viridiplantae#
Bacteria; Alphaproteobacteria#

      

+1


source


You can try the following Perl one liner

perl -aF';' -ne 'print "$F[0],$F[-1]"' test.txt

      

-a

Automatic split mode

-F';'

Setting the separator as;

And the split data stored in the array @F

$F[0]

Contains the first column (first index)

$F[-1]

Contains the last column (last index)

0


source







All Articles