How do I match multiple patterns in a specific column?
I was wondering if there would be a more efficient way to use awk / grep / sed to solve the following problem?
I would like to parse a specific column of my input file (column 1 in this example) and use awk / grep / any other function for a subset and select patterns that match my request. For example the below file
chr1 3009844 3009908 DXX 42 -
chr2 3000386 3000450 DXX 15 -
chr3 3000386 3000450 DXX 15 -
chr4 3000386 3000450 DXX 15 -
chr5 3000386 3000450 DXX 15 -
chr6 3000386 3000450 DXX 15 -
chr7 3000386 3000450 DXX 15 -
chr8 3000386 3000450 DXX 15 -
chr9 3000386 3000450 DXX 15 -
chr10 3000386 3000450 DXX 15 -
chr11 3000386 3000450 DXX 15 -
chr12 3000386 3000450 DXX 15 -
chr13 3000386 3000450 DXX 15 -
chr14 3000386 3000450 DXX 15 -
chr15 3000386 3000450 DXX 15 -
chr16 3000386 3000450 DXX 15 -
chr17 3000386 3000450 DXX 15 -
chr18 3000386 3000450 DXX 15 -
chr19 3000386 3000450 DXX 15 -
chrX 3000386 3000450 DXX 15 -
chrY 3000386 3000450 DXX 15 -
chr1_GL456210_random 3000386 3000450 DXX 15 -
chr1_GL456211_random 3000386 3000450 DXX 15 -
chr1_GL456212_random 3000386 3000450 DXX 15 -
chr1_GL456221_random 3000386 3000450 DXX 15 -
chr4_GL456216_random 3000386 3000450 DXX 15 -
chr4_JH584292_random 3000386 3000450 DXX 15 -
chr4_JH584295_random 3000386 3000450 DXX 15 -
chr5_GL456354_random 3000386 3000450 DXX 15 -
chr5_JH584296_random 3000386 3000450 DXX 15 -
chr5_JH584297_random 3000386 3000450 DXX 15 -
chr5_JH584299_random 3000386 3000450 DXX 15 -
chrX_GL456233_random 3000386 3000450 DXX 15 -
I would just like to get an output that only contains chr1-chr22, chrX and chrY in the first column,
chr1 3009844 3009908 DXX 42 -
chr2 3000386 3000450 DXX 15 -
chr3 3000386 3000450 DXX 15 -
chr4 3000386 3000450 DXX 15 -
chr5 3000386 3000450 DXX 15 -
chr6 3000386 3000450 DXX 15 -
chr7 3000386 3000450 DXX 15 -
chr8 3000386 3000450 DXX 15 -
chr9 3000386 3000450 DXX 15 -
chr10 3000386 3000450 DXX 15 -
chr11 3000386 3000450 DXX 15 -
chr12 3000386 3000450 DXX 15 -
chr13 3000386 3000450 DXX 15 -
chr14 3000386 3000450 DXX 15 -
chr15 3000386 3000450 DXX 15 -
chr16 3000386 3000450 DXX 15 -
chr17 3000386 3000450 DXX 15 -
chr18 3000386 3000450 DXX 15 -
chr19 3000386 3000450 DXX 15 -
chrX 3000386 3000450 DXX 15 -
chrY 3000386 3000450 DXX 15 -
I was able to find a solution using the following command:
awk '$1 == "chr1" || $1 == "chr2" || $1 == "chr3" || $1 == "chr4" || $1 == "chr5" || $1 == "chr6" || $1 == "chr7" || $1 == "chr8" || $1 == "chr9" || $1 == "chr10" || $1 == "chr11" || $1 == "chr12" || $1 == "chr13" || $1 == "chr14" || $1 == "chr15" || $1 == "chr16" || $1 == "chr17" || $1 == "chr18" || $1 == "chr19" || $1 == "chr20" || $1 == "chrX" || $1 == "chrY"' in_file > out_file
This works great, but was wondering if the expensive members would have a more elegant way to solve the problem? Or if you could point to a resource for learning awk / grep on linux, that would be very helpful!
source to share
Use a regular expression:
awk '$1 ~ /^chr(1?[0-9]|2[0-2]|X|Y)$/' file
This is used $1 ~ /^pattern$/
to select good strings that are exactly pattern
(note ^
the beginning and $
end).
The pattern is in the form chr(..|..|..)
, which means: match chr
, followed by any of the |
-separated conditions within ()
.
These conditions can be either:
- number (possibly 1 followed by a digit) (
1?[0-9]
) - a number that is 2 + any of 0, 1, 2 (
2[0-2]
) - X
- Y
The demo is automatically explained: https://regex101.com/r/gH1kS4/2
source to share
If you want something a little easier to maintain (like editing or adding new lines / patterns to match), and also something clearer, especially if you're just getting started with regular expressions, use the format grep -f match.list input.txt
:
Create a file with the patterns you want to match ( match.list
):
^chr[1-9][[:space:]]\| # this matches chr1-chr9
^chr1[0-9][[:space:]]\| # this matches chr10-chr19
^chr2[12][[:space:]]\| # this matches chr21-22
^chr[XY][[:space:]]\| # this matches chrX and chrY
new_string_or_pattern\| # ... your new pattern ...
Then just call grep
like this:
grep -f match.list input.txt
As you can see above, you can even add comments to the list of patterns using a trick \|
(ending each pattern with \|
) so that you can remember what you did yesterday or where you found the regexp. And you can add new fixed lines or patterns just by adding new lines. Also, if you find it difficult to create a complex regex, you can simply create a template file with the fixed strings you want to match:
^chrX ^chrY ...
Another advantage of this approach is that you can maintain multiple template files representing different subqueries that you might need on a daily basis. For example.
grep -f chromosomes_n input.txt
grep -f chromosomes_xy input.txt
grep -f chromosomes_random input.txt
The only downside to this approach is that it grep
will run slower if you add more than a dozen templates in each file. But this will only be a problem if your input file contains hundreds of thousands of lines.
source to share