Strange behavior regular expressions
I am writing a program to generate tokens from source in assembly, but I have a strange problem.
Sometimes the code works as expected and sometimes it doesn't!
Here is the code (the variables are in Portuguese, but I put the translation):
import re
def tokenize(code):
tokens = []
tokens_re = {
'comentarios' : '(//.*)', # comentary
'linhas' : '(\n)', # lines
'instrucoes' : '(add)', # instructions
'numeros_hex' : '([-+]?0x[0-9a-fA-F]+)', # hex numbers
'numeros_bin' : '([-+]?0b[0-1]+)', # binary numbers
'numeros_dec' : '([-+]?[0-9]+)'} # decimal numbers
#'reg32' : 'eax|ebx|ecx|edx|esp|ebp|eip|esi',
#'reg16' : 'ax|bx|cx|dx|sp|bp|ip|si',
#'reg8' : 'ah|al|bh|bl|ch|cl|dh|dl'}
pattern = re.compile('|'.join(list(tokens_re.values())))
scan = pattern.scanner(code)
while 1:
m = scan.search()
if not m:
break
tipo = list(tokens_re.keys())[m.lastindex-1] # type
valor = repr(m.group(m.lastindex)) # value
if tipo == 'linhas':
print('')
else:
print(tipo, valor)
return tokens
code = '''
add eax, 5 //haha
add ebx, -5
add eax, 1234
add ebx, 1234
add ax, 0b101
add bx, -0b101
add al, -0x5
add ah, 0x5
'''
print(tokenize(code))
And here's the expected output:
instrucoes 'add'
numeros_dec '5'
comentarios '//haha'
instrucoes 'add'
numeros_dec '-5'
instrucoes 'add'
numeros_dec '1234'
instrucoes 'add'
numeros_dec '1234'
instrucoes 'add'
numeros_bin '0b101'
instrucoes 'add'
numeros_bin '-0b101'
instrucoes 'add'
numeros_hex '-0x5'
instrucoes 'add'
numeros_hex '0x5'
The problem is that without changing the code, sometimes it gives the expected result, but sometimes it looks like this:
instrucoes 'add'
numeros_dec '5'
comentarios '//haha'
instrucoes 'add'
numeros_dec '-5'
instrucoes 'add'
numeros_dec '1234'
instrucoes 'add'
numeros_dec '1234'
instrucoes 'add'
numeros_dec '0'
numeros_dec '101'
instrucoes 'add'
numeros_dec '-0'
numeros_dec '101'
instrucoes 'add'
numeros_dec '-0'
numeros_dec '5'
instrucoes 'add'
numeros_dec '0'
numeros_dec '5'
Where is the problem?
source to share
You are building your regex from a dictionary. The dictionaries are not ordered, so the regex pattern may differ from time to time and therefore give different results.
If you want "stable" results, I suggest you use sorted(tokens_re.values())
or list them in a list / tuple rather than a dictionary.
For example, you can specify them as a list of pairs, and then use that list to build a template, and also build a dictionary:
tokens_re = [
('comentarios', '(//.*)'), # comentary
('linhas', '(\n)'), # lines
('instrucoes', '(add)'), # instructions
('numeros_hex', '([-+]?0x[0-9a-fA-F]+)'), # hex numbers
('numeros_bin', '([-+]?0b[0-1]+)'), # binary numbers
('numeros_dec', '([-+]?[0-9]+)'), # decimal numbers
]
pattern = re.compile('|'.join(p for _, p in tokens_re))
tokens_re = dict(tokens_re)
source to share