Strange behavior regular expressions

Question

Strange behavior regular expressions

I am writing a program to generate tokens from source in assembly, but I have a strange problem.

Sometimes the code works as expected and sometimes it doesn't!

Here is the code (the variables are in Portuguese, but I put the translation):

import re

def tokenize(code):
    tokens = []

    tokens_re = {
    'comentarios'  : '(//.*)',                         # comentary
    'linhas'       : '(\n)',                           # lines
    'instrucoes'   : '(add)',                          # instructions
    'numeros_hex'  : '([-+]?0x[0-9a-fA-F]+)',          # hex numbers
    'numeros_bin'  : '([-+]?0b[0-1]+)',                # binary numbers
    'numeros_dec'  : '([-+]?[0-9]+)'}                  # decimal numbers

    #'reg32'        : 'eax|ebx|ecx|edx|esp|ebp|eip|esi',
    #'reg16'        : 'ax|bx|cx|dx|sp|bp|ip|si',
    #'reg8'         : 'ah|al|bh|bl|ch|cl|dh|dl'}

    pattern = re.compile('|'.join(list(tokens_re.values())))
    scan = pattern.scanner(code)

    while 1:
        m = scan.search()
        if not m:
            break

        tipo = list(tokens_re.keys())[m.lastindex-1]     # type
        valor = repr(m.group(m.lastindex))               # value

        if tipo == 'linhas':
            print('')

        else:
            print(tipo, valor)

    return tokens



code = '''
add eax, 5 //haha
add ebx, -5
add eax, 1234
add ebx, 1234
add ax, 0b101
add bx, -0b101
add al, -0x5
add ah, 0x5
'''

print(tokenize(code))

And here's the expected output:

instrucoes 'add'
numeros_dec '5'
comentarios '//haha'

instrucoes 'add'
numeros_dec '-5'

instrucoes 'add'
numeros_dec '1234'

instrucoes 'add'
numeros_dec '1234'

instrucoes 'add'
numeros_bin '0b101'

instrucoes 'add'
numeros_bin '-0b101'

instrucoes 'add'
numeros_hex '-0x5'

instrucoes 'add'
numeros_hex '0x5'

The problem is that without changing the code, sometimes it gives the expected result, but sometimes it looks like this:

instrucoes 'add'
numeros_dec '5'
comentarios '//haha'

instrucoes 'add'
numeros_dec '-5'

instrucoes 'add'
numeros_dec '1234'

instrucoes 'add'
numeros_dec '1234'

instrucoes 'add'
numeros_dec '0'
numeros_dec '101'

instrucoes 'add'
numeros_dec '-0'
numeros_dec '101'

instrucoes 'add'
numeros_dec '-0'
numeros_dec '5'

instrucoes 'add'
numeros_dec '0'
numeros_dec '5'

Where is the problem?

+3

python python-3.x regex tokenize

TiberSeptim May 25 '15 at 12:27

source to share

1 answer

Stefan Pochmann · Accepted Answer · 2015-05-25T12:35:11+0000

You are building your regex from a dictionary. The dictionaries are not ordered, so the regex pattern may differ from time to time and therefore give different results.

If you want "stable" results, I suggest you use sorted(tokens_re.values())

or list them in a list / tuple rather than a dictionary.

For example, you can specify them as a list of pairs, and then use that list to build a template, and also build a dictionary:

tokens_re = [
    ('comentarios', '(//.*)'),                         # comentary
    ('linhas',      '(\n)'),                           # lines
    ('instrucoes',  '(add)'),                          # instructions
    ('numeros_hex', '([-+]?0x[0-9a-fA-F]+)'),          # hex numbers
    ('numeros_bin', '([-+]?0b[0-1]+)'),                # binary numbers
    ('numeros_dec', '([-+]?[0-9]+)'),                  # decimal numbers
]
pattern = re.compile('|'.join(p for _, p in tokens_re))
tokens_re = dict(tokens_re)

Strange behavior regular expressions

More articles: