Junk list scrolling with pirage
I have a string consisting of a list of words that I am trying to parse with pyparsing.
There are always at least three items in the list. From this I want pyparsing to generate three groups, the first one containing all words up to the last two elements, and the last two groups should be the last two. For example:
"one two three four"
should be parsed into something similar:
["one two"], "three", "four"
I can do it with Regex:
import pyparsing as pp
data = "one two three four"
grammar = pp.Regex(r"(?P<first>(\w+\W?)+)\s(?P<penultimate>\w+) (?P<ultimate>\w+)")
print(grammar.parseString(data).dump())
which gives:
['one two three four']
- first: one two
- penultimate: three
- ultimate: four
My problem is that I am not getting the same result with a non-Regex ParserElement due to the pirage of the greedy nature, for example the following:
import pyparsing as pp
data = "one two three four"
word = pp.Word(pp.alphas)
grammar = pp.Group(pp.OneOrMore(word))("first") + word("penultimate") + word("ultimate")
grammar.parseString(data)
unable to trace:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/site-packages/pyparsing.py", line 1125, in parseString
raise exc
pyparsing.ParseException: Expected W:(abcd...) (at char 18), (line:1, col:19)
because OneOrMore overlaps all words in the list. My attempts so far to prevent this greedy behavior with FollowedBy or NotAny are failing - any suggestions on how I can get the behavior I want?
source to share
Well, your OneOrMore expression needs a little tightening up - you are on the right track with FollowedBy. You really don't want just OneOrMore (a word), you want "OneOrMore (a word followed by at least 2 more words)". To add this kind of look to pyparsing, you can even use the new multiplication operator '*' to indicate the scoring:
grammar = pp.Group(pp.OneOrMore(word + pp.FollowedBy(word*2)))("first") + word("penultimate") + word("ultimate")
Now resetting this value gives the desired result:
[['one', 'two'], 'three', 'four']
- first: ['one', 'two']
- penultimate: three
- ultimate: four
source to share