How to extract characters before a pattern
I need help on how to extract a specific string of a string.
I have a file with thousands of lines, for example:
Eukaryota; Alveolata; Ciliophora; Intramacronucleata; Paramecium#
Eukaryota; Viridiplantae; Streptophyta; Embryophyta#
Bacteria; Cyanobacteria; Synechococcales; Acaryochloridaceae; Acaryochloris#
Eukaryota; Viridiplantae#
Bacteria; Proteobacteria; Alphaproteobacteria#
And I would like to get the first and last element of each row. So the output will be:
Eukaryota; Paramecium#
Eukaryota; Embryophyta#
Bacteria; Acaryochloris#
Eukaryota; Viridiplantae#
Bacteria; Alphaproteobacteria#
I know how to get the 1st column with
awk '{print$1}' fileIn > fileOut
but I dont know how to get the last item as it is always in different columns.
I tried adding # and then just keeping the XX characters in front of the # with
grep -E -o '.{X,X}PATTERN. fileIn > fileOut
where the output looks like this: le; Sulfolobaceae; Sulfolobus #; Thermoproteaceae; Caldivirga # le; Haloferacaceae; Haloferax # Haloferacaceae; Haloquadratum # Ale; Natrialbaceae; Natrialba #
But then I have to repeat the procedure and delete; until I stay with the last one.
I have a search to see if there is any grep or awk option, extract the 1st and last columns, or extract only the characters attached to #, but I couldn't find this work for me.
I would be grateful for any suggestions on how to proceed.
Thank.
source to share
Since the delimiter is in your file ;
, you can also use gsub(/;.*;/,";",$0)
to filter a field between two ;
to get the first and last field.
$ awk '{gsub(/;.*;/,";")}1' fileIn > fileOut
$ cat fileOut
Eukaryota; Paramecium#
Eukaryota; Embryophyta#
Bacteria; Acaryochloris#
Eukaryota; Viridiplantae#
Bacteria; Alphaproteobacteria#
source to share