PyParsing simple language expressions
I am trying to write something that will parse some code. I can parse foo(spam)
and successfully spam+eggs
, but foo(spam+eggs)
(recursive descent? My terminology from compilers is a bit rusty) fails.
I have the following code:
from pyparsing_py3 import *
myVal = Word(alphas+nums+'_')
myFunction = myVal + '(' + delimitedList( myVal ) + ')'
myExpr = Forward()
mySubExpr = ( \
myVal \
| (Suppress('(') + Group(myExpr) + Suppress(')')) \
| myFunction \
)
myExpr << Group( mySubExpr + ZeroOrMore( oneOf('+ - / * =') + mySubExpr ) )
# SHOULD return: [blah, [foo, +, bar]]
# but actually returns: [blah]
print(myExpr.parseString('blah(foo+bar)'))
A couple of questions: delimitedList looks for the comma-delimited list myVal, ie identifiers, as the only acceptable form of argument list, so of course it cannot match "foo + bar" (not the comma-delimited list myVal!); a fix that shows differently - myVal and myFunction start the same, so their order in mySubExpr matters; which shows another one - two levels of nesting instead of one. These versions look ok ...:
myVal = Word(alphas+nums+'_')
myExpr = Forward()
mySubExpr = (
(Suppress('(') + Group(myExpr) + Suppress(')'))
| myVal + Suppress('(') + Group(delimitedList(myExpr)) + Suppress(')')
| myVal
)
myExpr << mySubExpr + ZeroOrMore( oneOf('+ - / * =') + mySubExpr )
print(myExpr.parseString('blah(foo+bar)'))
emits ['blah', ['foo', '+', 'bar']]
at will. I've also removed the redundant backslashes, since the logical continuation of the line occurs within parentheses; they were harmless, but obstructed readability.
source to share
I have found that it is a good habit to enter into when using the '<<forward operator is to always enclose the RHS in parentheses. I.e:
myExpr << mySubExpr + ZeroOrMore( oneOf('+ - / * =') + mySubExpr )
it's better:
myExpr << ( mySubExpr + ZeroOrMore( oneOf('+ - / * =') + mySubExpr ) )
This is the result of my unfortunate choice of '<<as the "insert" operator to insert an expression into Forward. The parentheses are unnecessary in this particular case, but in this case:
integer = Word(nums)
myExpr << mySubExpr + ZeroOrMore( oneOf('+ - / * =') + mySubExpr ) | integer
we see why I say "unfortunate". If I simplify this to "A <B | C", we can easily see that the priority of the operations forces execution to be evaluated as "(A <B) | C", since "<<has a higher priority than '|'. The result is that Forward A only gets expression B inserted into it. The "| C "is executed, but what happens is what you get" A | C ", which creates a MatchFirst object, which is immediately discarded because it has not been assigned any variable names. The solution would be to group the statement in parentheses as" A <(B | C) ". In expressions that only from the "+" operations, there is no actual need for parentheses, since "+" has higher precedence than "<<". But this is just good coding and causes a problem when someone later adds an alternative expression using '|' and doesn't understand the implications of precedence, so I suggest simply adopting the "A <(expression)" style to avoid this confusion.
(Someday I'll write pyparsing 2.0, which will allow me to break compatibility with existing code - and change that to use the <<= 'operator, which fixes all these priority issues, since' <<<<<<<<< <<<<<<; = 'has a lower precedence than any other operator used by peering.)
source to share