PyParsing simple language expressions

Question

PyParsing simple language expressions

I am trying to write something that will parse some code. I can parse foo(spam)

and successfully spam+eggs

, but foo(spam+eggs)

(recursive descent? My terminology from compilers is a bit rusty) fails.

I have the following code:

from pyparsing_py3 import *

myVal = Word(alphas+nums+'_')    
myFunction = myVal + '(' + delimitedList( myVal ) + ')'

myExpr = Forward()
mySubExpr = ( \
    myVal \
    | (Suppress('(') + Group(myExpr) + Suppress(')')) \
    | myFunction \
    )
myExpr << Group( mySubExpr + ZeroOrMore( oneOf('+ - / * =') + mySubExpr ) )


# SHOULD return: [blah, [foo, +, bar]]
# but actually returns: [blah]
print(myExpr.parseString('blah(foo+bar)'))

+2

python parsing pyparsing

ash 04 Sep '09 at 12:46

source to share

2 answers

I have found that it is a good habit to enter into when using the '<<forward operator is to always enclose the RHS in parentheses. I.e:

myExpr << mySubExpr + ZeroOrMore( oneOf('+ - / * =') + mySubExpr )

it's better:

myExpr << ( mySubExpr + ZeroOrMore( oneOf('+ - / * =') + mySubExpr ) )

This is the result of my unfortunate choice of '<<as the "insert" operator to insert an expression into Forward. The parentheses are unnecessary in this particular case, but in this case:

integer = Word(nums)
myExpr << mySubExpr + ZeroOrMore( oneOf('+ - / * =') + mySubExpr ) | integer

we see why I say "unfortunate". If I simplify this to "A <B | C", we can easily see that the priority of the operations forces execution to be evaluated as "(A <B) | C", since "<<has a higher priority than '|'. The result is that Forward A only gets expression B inserted into it. The "| C "is executed, but what happens is what you get" A | C ", which creates a MatchFirst object, which is immediately discarded because it has not been assigned any variable names. The solution would be to group the statement in parentheses as" A <(B | C) ". In expressions that only from the "+" operations, there is no actual need for parentheses, since "+" has higher precedence than "<<". But this is just good coding and causes a problem when someone later adds an alternative expression using '|' and doesn't understand the implications of precedence, so I suggest simply adopting the "A <(expression)" style to avoid this confusion.

(Someday I'll write pyparsing 2.0, which will allow me to break compatibility with existing code - and change that to use the <<= 'operator, which fixes all these priority issues, since' <<<<<<<<< <<<<<<; = 'has a lower precedence than any other operator used by peering.)

+4

PaulMcG 04 Sep '09 at 7:29

source to share

Alex martelli · Accepted Answer · 2009-09-04T01:47:33+0000

A couple of questions: delimitedList looks for the comma-delimited list myVal, ie identifiers, as the only acceptable form of argument list, so of course it cannot match "foo + bar" (not the comma-delimited list myVal!); a fix that shows differently - myVal and myFunction start the same, so their order in mySubExpr matters; which shows another one - two levels of nesting instead of one. These versions look ok ...:

myVal = Word(alphas+nums+'_')    

myExpr = Forward()
mySubExpr = (
    (Suppress('(') + Group(myExpr) + Suppress(')'))
    | myVal + Suppress('(') + Group(delimitedList(myExpr)) + Suppress(')')
    | myVal
    )
myExpr << mySubExpr + ZeroOrMore( oneOf('+ - / * =') + mySubExpr ) 

print(myExpr.parseString('blah(foo+bar)'))

emits ['blah', ['foo', '+', 'bar']]

at will. I've also removed the redundant backslashes, since the logical continuation of the line occurs within parentheses; they were harmless, but obstructed readability.

PyParsing simple language expressions

More articles: