Python, a comprehensive regular expression parser

So I'm considering parsing code using regular expressions, and I'm wondering if there is an easier way to do this than what I have so far. I'll start with an example of a line that I would parse through:

T16F161A286161990200040000\r

(Data is transferred via serial device)

Now, first I need to check the verification code, which is the first 9 characters of the code. They should be accurate T16F161A2

. If these 9 characters are exactly the same, then I need to check the next 3 characters, which should be either 861

or 37F

.

If these are 3 characters 37F

, I have something that I still need to encode, so we won't bother with that result.

However, if these are 3 characters 861

, I need this to check the 2 characters after them and see what they are. They can be 11

, 14

, 60

, 61

, F0

, F1

or F2

. Each of them does different things with the data that precedes it.

Finally, I need to loop through the remaining characters, connecting them in two.

An example of how it works, here is the code I put together to parse the example line posted above:

import re

test_string = "T16F161A286161990200040000\r"

if re.match('^T16F161A2.*', test_string):
    print("Match: ", test_string)
    test_string = re.sub('^T16F161A2', '', test_string)
    if re.match('^861.*', test_string):
        print("Found '861': ", test_string)
        test_string = re.sub('^861', '', test_string)
        if re.match('^61.*', test_string):
            print("Found '61' : ", test_string)
            test_string = re.sub('^61', '', test_string)
            for i in range(6):
                if re.match('^[0-9A-F]{2}', test_string):
                    temp = re.match('^[0-9A-F]{2}', test_string).group()
                    print("Found Code: ", temp)
                test_string = re.sub('^[0-9A-F]{2}', '', test_string)

      

Now, as you can see in this code, after each step I use re.sub()

to delete the part of the line that I was just looking for. With this in mind, my question is the following:

Is there a way to parse the string and find the data I want and also keep the string intact? Would it be more or less effective what I have now?

+3


source to share


7 replies


You don't need a regex for this task, you can use if / else blocks and a few string substitutions:

test_string = "T16F161A286161990200040000\r"

def process(input):
  # does a few stuff with 11, 14, 60, 61, F0, F1, or F2
  return

def stringToArray(input):
  return [tempToken[i:i+2] for i in range(0, len(tempToken), 2)]



if not test_string.startswith('T16F161A2'):
  print ("Does not match")
  quit()
else:
  print ("Does match")

tempToken = test_string[9:]

if tempToken.startswith('861'):
  process(tempToken) #does stuff with 11, 14, 60, 61, F0, F1, or F2
  tempToken = tempToken[5:]

  print (stringToArray(tempToken))
else:
  pass

      



You can see it live here

+2


source


I would recommend (because you know the size of the string) instead:

  • Test 9 first by comparing test_string [: 9] == T16F161A2

I would do it for the second phase too (test_string [9:12]). This comparison is much faster than regular expression.



When using a known size, you can call your string like I did above. This will not "destroy" your string the way it does now. That is, re.search (pattern, test_string [9:12]).

Hope this helps you at least a little. :)

0


source


Assuming the string is the same length every time, and the data is at the same index, you can simply use the strings [] splicer. To get the first 9 characters you should use: test_string[:10]

You can set them as variables and make checking easier:

confirmation_code = test_string[:10]
nextThree = test_string[10:13]
#check values

      

This is a built-in method in python, so it can be said to be quite efficient.

0


source


If you want to stick with a regular expression, then this can do:

pattern = re.compile(r'^T16F161A2((861)|37F)(?(2)(11|14|60|61|F0|F1|F2)|[0-9A-F]{2})([0-9A-F]{12})$')
match_result = pattern.match(test_string)

      

In this case, you can check if a match_result

valid match object (if not, then there was no matching pattern). This match object will contain 4 elements: - first 3 groups (861 or 37F) - useless data (ignore this) - 2 char code in case of first element - 861 ( None

otherwise) - last 12 digits

To split the last 12 digits into one liner:

last_12_digits = match_result[3]
last_digits = [last_12_digits[i:i+2] for i in range(0, len(last_12_digits), 2)]

      

0


source


You don't really need regular expressions, since you know exactly what you are looking for and where to find them in the string, you can just use slicing and a couple of if / elif / else statements. Something like that:

s = test_string.strip()
code, x, y, rest = s[:9], s[9:12], s[12:14], [s[i:i+2] for i in range(14, len(s), 2)]
# T16F161A2, 861, 61, ['99', '02', '00', '04', '00', '00']

if code == "T16F161A2":
    if x == "37F":
    elif x == "861":
        if y == "11":
            ...
        if y == "61":
            # do stuff with rest
    else:
        # invalid
else:
    # invalid

      

0


source


Perhaps something like:

import re

regex = r'^T16F161A2(861|37f)(11|14|60|61|F0|F1|F2)(.{2})(.{2})(.{2})(.{2})(.{2})(.{2})$'
string = 'T16F161A286161990200040000'

print re.match(regex,string).groups()

      

This uses capture groups and avoids the need to create a bunch of new lines.

0


source


A module re

won't be as efficient as direct substring access, but it can save you to write (and maintain) some lines of code. But if you want to use it, you must match the line as a whole:

import re

test_string = "T16F161A286161990200040000\r"

rx = re.compile(r'T16F161A2(?:(?:(37F)(.*))|(?:(861)(11|14|60|61|F0|F1|F2)(.*)))\r')
m = rx.match(test_string)      # => 5 groups, first 2 if 37F, last 3 if 861

if m is None:                  # string does not match:
    ...
elif m.group(1) is None:       # 861 type
    subtype = m.group(4)       # extract subtype
    # and group remaining characters by pairs
    elts = [ m.group(5)[i:i+2] for i in range(0, len(m.group(5)), 2) ]
    ...                        # process that
else:                          # 37F type
    ...

      

0


source







All Articles