Regular Expression Puzzle

In (Visual Basic, .NET):

  Dim result As Match = Regex.Match(aStr, aMatchStr)
  If result.Success Then
      Dim result0 As String = result.Groups(0).Value
      Dim result1 As String = result.Groups(1).Value
  End If

      

C: aStr equal to (space is normal space, and between n

and (

seven spaces):

"AMEVDIEERPK + 7 Oxidation       (M)"

      

Why result1

does it become empty string for aMatchStr equal to

"\s*(\d*).*?Oxidation\s+\(M\)"

      

but becomes "7" for aMatchStr

, equal

"\s*(\d*)\s*Oxidation\s+\(M\)"

      

?

( result0

becomes equal to "AMEVDIEERPK + 7 Oxidation (n)" ())

(This is from MSQuant , MascotResultParser.vb , function modificationParseMatch()

).

0


source to share


8 answers


\ s * Zero or more spaces

(\ d *) Zero or more digits (removed)

... *? Any characters (not greedy, so until the next match

Oxidation Corresponds to the word "Oxidation"

\ s + (M) Match one or more spaces followed by (M)



The problem is that you are matching 0 or more characters before the word "Oxidation", including any possible digits that contain digits that might match the previous \ d

\ S * (\ d *) \ S * Oxidation \ S + (M)

The difference here is that you only put spaces before Oxidation. There are no numbers.

Change \ d * to \ d + to catch numbers

+4


source


I think because the match starts at the first character and goes from there ...

For your first regex:

Does "AMEVDIEERPK + 7 Oxidation (M)" match "\s*(\d*).*?Oxidation\s+(M)"?  Yes.. stop matching.

      

For your second regex:



Does "AMEVDIEERPK + 7 Oxidation (M)" match "\s*(\d*)\s*Oxidation\s+(M)"?  No...
Does "MEVDIEERPK + 7 Oxidation (M)" match "\s*(\d*)\s*Oxidation\s+(M)"?  No...
Does "EVDIEERPK + 7 Oxidation (M)" match "\s*(\d*)\s*Oxidation\s+(M)"?  No...
...
Does " 7 Oxidation (M)" match "\s*(\d*)\s*Oxidation\s+(M)"?  Yes

      

If you used \d+

instead for the first regex \d*

, you would have a better result.

This is not exactly how regex works, but you get the idea.

+3


source


". *?" in this example null characters will always match, since "*?" the shortest possible match. As a result, since the thing right before the "O" is a space, "\ d *" can match 0 digits.

(Sorry for the spaces in the quotes, the autoformatter is eating my syntax.)

Ref: Quanters in Regular Expressions (MSDN)

+1


source


Thanks for the quick answers!

Numbers at the input are not counted if there is only one (peptide) instead of 7, as in the previous one, for example:

"AMEVDIEERPK + Oxidation (M)"

and there would be no match if "\ d +" was used. But maybe I should use two regular expressions, one for each of these two cases. This would increase the complexity of the program somewhat (as I want to avoid memory garbage from constructing the regex for each line matches), but is acceptable.

I really wanted the user to match the rule without requiring the rule to match the start of the (peptide) modification (which is why I tried to introduce a non-living match).

Currently the user rule is added with "\ s * (\ d *) \ s *" and therefore the user must specify "Oxidation \ s + (M)" to match. Indication, for example. "dation \ s + (M)" will not work.

+1


source


To reply to your second post, you (or your user) can specify \w*dation\s+\(M\)

either Oxydation (M), gradation (M), or dation (M) to match.

+1


source


With the syntax update, it seems that we don't need to worry about the difference between \ d + and \ d *. The + sign is always present, even if there are no numbers. Matching this + holds back the regex to the point that it works as expected:

"\s*    // whitespace before +
 \+     // The + sign itself
 \s*    // whitespace after +
 (\d*)  // optional digits
 .*?    // any non-digit between the last digit and Oxidation (M)
 Oxidation\s+\(M\)"

      

Since the "+" character must be matched first and must be matched exactly once, the AMEVDIEERPK prefix cannot be matched. * ?.

+1


source


I decided to use \w*

it for now. The user will be required to specify a match for any space, but it covers most of the cases for this particular application and how it is commonly used.

So, for example, this is a regex:

\s*(\d*)\s*\w*Oxidation\s+\(M\)

      

+1


source


I'm sorry, there is more syntax ...

You cannot rely on the plus sign. It separates (peptide) sequences and (peptide) modifications. There can be more than one modification for each sequence. Sample with two modifications (there are 7 spaces between "2" and "L"):

"KLIDLTQFPAFVTPMGK + Oxidation (M); 2 Lysine-13C615N2 (K-full)"

User can specify "\ S + \ s + (K-full)" for the second modification and "2".

Here are some more lines (after the plus sign):

"Phospho (ST), 2 Dimethyl (K), Dimethyl (N-member)"

"Phospho (ST), 2 Dimethyl: 2H (4) (K), dimethyl: 2H (4) (N-member)"

"N-Acetyl (protein)"

"2 Dimethyl: 2H (4) (K), dimethyl: 2H (4) (N-member)"

"N-acetyl (protein), 2 lysine-13C615N2 (K-complete)"

"Oxidation (M), N-acetyl (protein)"

"Oxidation (M), N-acetyl (protein), lysine-13C615N2 (K-complete)"

"N-acetyl (protein), lysine-13C615N2 (K-complete)"

"Oxidation (M), lysine-13C615N2 (K-complete)"

"Oxidation (M)"

"2 Oxidation (M); Lysine-13C615N2 (K-full)"

An example file with custom rules can be found at (packed in 7-zip format):

< http://www.pil.sdu.dk/1/MSQuant/CEBIquantModes,2008-11-10.7z >

+1


source







All Articles