Match multiple patterns in any order (Perl)

I have a file that looks like (but LOT is bigger):

arbstring1014: 120|PROKKA_00511 630|PROKKA_01218 630|PROKKA_01999 630|PROKKA_00506
arbstring1015: 120|PROKKA_02025 630|PROKKA_03113 120|PROKKA_02363 196|PROKKA_02308
arbstring1016: 120|PROKKA_02059 196|PROKKA_03630 630|PROKKA_03589 630|PROKKA_00462
arbstring1017: 120|PROKKA_02961 196|PROKKA_03061 630|PROKKA_03283 120|PROKKA_03099
arbstring1025: 120|PROKKA_02979 196|PROKKA_02928 630|PROKKA_03158
arbstring1026: 120|PROKKA_00924 196|PROKKA_00857 630|PROKKA_00906
arbstring1027: 120|PROKKA_02739 196|PROKKA_02684 630|PROKKA_02848
arbstring1028: 120|PROKKA_01415 196|PROKKA_01350 630|PROKKA_01503
arbstring1029: 120|PROKKA_03195 196|PROKKA_03175 630|PROKKA_03374
arbstring1030: 120|PROKKA_03050 196|PROKKA_03001 630|PROKKA_03230

      

I want to find lines that have everything before "PROKKA_XXXXX":

120|
196|
630|

      

The following script will find them, but apparently only in the order in which they are written in the script (for example, it returns a line with 196 |, 120 |, 630 | when I know for a fact there are lines with all three, but in different order):

#!/usr/bin/perl -w use strict; use warnings;

#get genes that are present in all groups  from a groups.txt

#scans through output of orthomcl to get genes that are only core open (IN,"<$ARGV[0]")  or die $!;

while (my $line = <IN>) {
#change the VS1 to match your unique phage ID add "& ($line =~m/VS11\|/)" to add more rules to match . will need 15 for 15 phage if ($line =~ m/196\|/gi && $line =~ m/120\|/gi && $line =~
m/630\|/gi)#(=~m/120\|/gi))#($line =~m/196\|/gi)

#if (/(?=.*re1)(?=.*re2)(?=.*re3)/s)

#& ($line =~m/630\|/) & ($line =~m/120\|/)   #& ($line =~m/IME1\|/) #&
#($line =~m/KBNP\|/) & ($line =~m/LUZ7\|/) & ($line =~m/PA26\|/) & ($line =~m/RLP1\|/) & ($line =~m/VC01\|/) &
#($line =~m/DSS3\|/)  & ($line =~m/EcP1\|/)  & ($line =~m/G7C\|/) & ($line =~m/JA1\|/) & ($line =~m/LIT1\|/) &
#($line =~m/N4\|/) & ($line =~m/pS6\|/) & ($line =~m/RPP1\|/) & ($line =~m/VBP3\|/) & ($line =~m/VBP4\|/) &
#($line =~m/058\|/)  &  ($line =~m/076\|/)  &  ($line =~m/JWA\|/)  &  ($line =~m/JWD\|/) & ($line =~m/PRES\|/)    { print $line ; } }

      

Any help with this would be brilliant as I've already looked at the honest bit ...

+3


source to share


2 answers


I would suggest using a forecast:

^
(?=.*120\|PROKKA_\d+)
(?=.*196\|PROKKA_\d+)
(?=.*630\|PROKKA_\d+)
.*

      



regex101.com demo

(this is split across multiple lines for readability only). Starting at the beginning of each line, look at all 3 of your criteria: 120, 196, and 630. If found, .*

will match that line.

+2


source


The code you pasted there has an answer and even explains it in a comment, except it's screwed up.

What did you put in:

while (my $line = <IN>) {
#change the VS1 to match your unique phage ID add "& ($line =~m/VS11\|/)" to add more rules to match . will need 15 for 15 phage if ($line =~ m/196\|/gi && $line =~ m/120\|/gi && $line =~
m/630\|/gi)#(=~m/120\|/gi))#($line =~m/196\|/gi)

      

doesn't make any sense. What appears to be intended looks something like this:



while (my $line = <IN>) {
# change the numbers, which are phage IDs;
# e.g., to match your unique phage ID, say 196, add:
#   && ($line =~ m/196\|/)
#
  if ($line =~ m/196\|/gi && $line =~ m/120\|/gi && $line =~ m/630\|/gi)) {

      

then the code comes to be executed if it $line

matches these three, and after the code, the sentences if

and while

should be closed:

  }
}

      

This can be made more readable, but we'll need a complete script for that.

0


source







All Articles