How do I match multiple patterns in a specific column?

I was wondering if there would be a more efficient way to use awk / grep / sed to solve the following problem?

I would like to parse a specific column of my input file (column 1 in this example) and use awk / grep / any other function for a subset and select patterns that match my request. For example the below file

chr1    3009844 3009908 DXX 42  -
chr2    3000386 3000450 DXX 15  -
chr3    3000386 3000450 DXX 15  -
chr4    3000386 3000450 DXX 15  -
chr5    3000386 3000450 DXX 15  -
chr6    3000386 3000450 DXX 15  -
chr7    3000386 3000450 DXX 15  -
chr8    3000386 3000450 DXX 15  -
chr9    3000386 3000450 DXX 15  -
chr10   3000386 3000450 DXX 15  -
chr11   3000386 3000450 DXX 15  -
chr12   3000386 3000450 DXX 15  -
chr13   3000386 3000450 DXX 15  -
chr14   3000386 3000450 DXX 15  -
chr15   3000386 3000450 DXX 15  -
chr16   3000386 3000450 DXX 15  -
chr17   3000386 3000450 DXX 15  -
chr18   3000386 3000450 DXX 15  -
chr19   3000386 3000450 DXX 15  -
chrX    3000386 3000450 DXX 15  -
chrY    3000386 3000450 DXX 15  -
chr1_GL456210_random    3000386 3000450 DXX 15  -
chr1_GL456211_random    3000386 3000450 DXX 15  -
chr1_GL456212_random    3000386 3000450 DXX 15  -
chr1_GL456221_random    3000386 3000450 DXX 15  -
chr4_GL456216_random    3000386 3000450 DXX 15  -
chr4_JH584292_random    3000386 3000450 DXX 15  -
chr4_JH584295_random    3000386 3000450 DXX 15  -
chr5_GL456354_random    3000386 3000450 DXX 15  -
chr5_JH584296_random    3000386 3000450 DXX 15  -
chr5_JH584297_random    3000386 3000450 DXX 15  -
chr5_JH584299_random    3000386 3000450 DXX 15  -
chrX_GL456233_random    3000386 3000450 DXX 15  -

      

I would just like to get an output that only contains chr1-chr22, chrX and chrY in the first column,

chr1    3009844 3009908 DXX 42  -
chr2    3000386 3000450 DXX 15  -
chr3    3000386 3000450 DXX 15  -
chr4    3000386 3000450 DXX 15  -
chr5    3000386 3000450 DXX 15  -
chr6    3000386 3000450 DXX 15  -
chr7    3000386 3000450 DXX 15  -
chr8    3000386 3000450 DXX 15  -
chr9    3000386 3000450 DXX 15  -
chr10   3000386 3000450 DXX 15  -
chr11   3000386 3000450 DXX 15  -
chr12   3000386 3000450 DXX 15  -
chr13   3000386 3000450 DXX 15  -
chr14   3000386 3000450 DXX 15  -
chr15   3000386 3000450 DXX 15  -
chr16   3000386 3000450 DXX 15  -
chr17   3000386 3000450 DXX 15  -
chr18   3000386 3000450 DXX 15  -
chr19   3000386 3000450 DXX 15  -
chrX    3000386 3000450 DXX 15  -
chrY    3000386 3000450 DXX 15  -

      

I was able to find a solution using the following command:

awk '$1 == "chr1" || $1 == "chr2" || $1 == "chr3" || $1 == "chr4" || $1 == "chr5" || $1 == "chr6" || $1 == "chr7" || $1 == "chr8" || $1 == "chr9" || $1 == "chr10" || $1 == "chr11" || $1 == "chr12" || $1 == "chr13" || $1 == "chr14" || $1 == "chr15" || $1 == "chr16" || $1 == "chr17" || $1 == "chr18" || $1 == "chr19" || $1 == "chr20" || $1 == "chrX" || $1 == "chrY"'  in_file > out_file

      

This works great, but was wondering if the expensive members would have a more elegant way to solve the problem? Or if you could point to a resource for learning awk / grep on linux, that would be very helpful!

+3


source to share


5 answers


Use a regular expression:

awk '$1 ~ /^chr(1?[0-9]|2[0-2]|X|Y)$/' file

      

This is used $1 ~ /^pattern$/

to select good strings that are exactly pattern

(note ^

the beginning and $

end).

The pattern is in the form chr(..|..|..)

, which means: match chr

, followed by any of the |

-separated conditions within ()

.



These conditions can be either:

  • number (possibly 1 followed by a digit) ( 1?[0-9]

    )
  • a number that is 2 + any of 0, 1, 2 ( 2[0-2]

    )
  • X
  • Y

The demo is automatically explained: https://regex101.com/r/gH1kS4/2

+3


source


If you want something a little easier to maintain (like editing or adding new lines / patterns to match), and also something clearer, especially if you're just getting started with regular expressions, use the format grep -f match.list input.txt

:

Create a file with the patterns you want to match ( match.list

):

^chr[1-9][[:space:]]\|      # this matches chr1-chr9
^chr1[0-9][[:space:]]\|     # this matches chr10-chr19
^chr2[12][[:space:]]\|      # this matches chr21-22
^chr[XY][[:space:]]\|       # this matches chrX and chrY
new_string_or_pattern\|     # ... your new pattern ...

      

Then just call grep

like this:

grep -f match.list input.txt

      



As you can see above, you can even add comments to the list of patterns using a trick \|

(ending each pattern with \|

) so that you can remember what you did yesterday or where you found the regexp. And you can add new fixed lines or patterns just by adding new lines. Also, if you find it difficult to create a complex regex, you can simply create a template file with the fixed strings you want to match:

^chrX
^chrY
...

      

Another advantage of this approach is that you can maintain multiple template files representing different subqueries that you might need on a daily basis. For example.

grep -f chromosomes_n input.txt
grep -f chromosomes_xy input.txt
grep -f chromosomes_random input.txt

      

The only downside to this approach is that it grep

will run slower if you add more than a dozen templates in each file. But this will only be a problem if your input file contains hundreds of thousands of lines.

+2


source


You can use this simplified regex with grep

:

grep "^chr\(1\?[0-9]\|2[012]\|[XY]\)[[:space:]]" filename

      

The logic is in parentheses \(..\)

  • 1\?[0-9]

    - matches 0-9, optionally preceded by 1
  • 2[012]

    - match 2 followed by 0, 1, or 2
  • [XY]

    - match X or Y
+1


source


Given your posted example, all you need to get the desired output is either one of these (or other simple REs):

awk '$1 !~ /_/' file
awk '$1 ~ /^[[:alnum:]]+$/' file

      

so you won't need to list specific "templates" at all depending on your real-world requirements.

0


source


The work will be done below.

grep -v -w 'random'

      

-1


source







All Articles