Junk list scrolling with pirage

I have a string consisting of a list of words that I am trying to parse with pyparsing.

There are always at least three items in the list. From this I want pyparsing to generate three groups, the first one containing all words up to the last two elements, and the last two groups should be the last two. For example:

"one two three four"

      

should be parsed into something similar:

["one two"], "three", "four"

      

I can do it with Regex:

import pyparsing as pp
data = "one two three four"
grammar = pp.Regex(r"(?P<first>(\w+\W?)+)\s(?P<penultimate>\w+) (?P<ultimate>\w+)")
print(grammar.parseString(data).dump())

      

which gives:

['one two three four']
- first: one two
- penultimate: three
- ultimate: four

      

My problem is that I am not getting the same result with a non-Regex ParserElement due to the pirage of the greedy nature, for example the following:

import pyparsing as pp
data = "one two three four"
word = pp.Word(pp.alphas)
grammar = pp.Group(pp.OneOrMore(word))("first") + word("penultimate") + word("ultimate")
grammar.parseString(data)

      

unable to trace:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/site-packages/pyparsing.py", line 1125, in parseString
    raise exc
pyparsing.ParseException: Expected W:(abcd...) (at char 18), (line:1, col:19)

      

because OneOrMore overlaps all words in the list. My attempts so far to prevent this greedy behavior with FollowedBy or NotAny are failing - any suggestions on how I can get the behavior I want?

+3


source to share


1 answer


Well, your OneOrMore expression needs a little tightening up - you are on the right track with FollowedBy. You really don't want just OneOrMore (a word), you want "OneOrMore (a word followed by at least 2 more words)". To add this kind of look to pyparsing, you can even use the new multiplication operator '*' to indicate the scoring:

grammar = pp.Group(pp.OneOrMore(word + pp.FollowedBy(word*2)))("first") + word("penultimate") + word("ultimate")

      



Now resetting this value gives the desired result:

[['one', 'two'], 'three', 'four']
- first: ['one', 'two']
- penultimate: three
- ultimate: four

      

+2


source







All Articles