Pyparsing: nested Markdown selection
I joke with some plain Markdown text to play and learn Pyparsing and grammar in general. I immediately ran into the problem that I have problems solving the problem. I am trying to parse a simple version of CommonMark for emphasis. This setting allows a nested accent to
*foo *bar* baz*
<em>foo <em>bar</em> baz</em>
I tried using a recursive definition to match this, but it doesn't work. Here's some sample code:
from pyparsing import * text = Word(printables,excludeChars="*") enclosed = Forward() emphasis = QuotedString("*").setParseAction(lambda x: "<em>%s</em>" % x,contents=enclosed) enclosed << emphasis | text test = """ *foo *bar* bar* """ print emphasis.transformString(test)
But I'll get back from this:
<em>foo </em>bar<em> bar</em>
Forgive my nobility; can anyone point me in the right direction?
In response to a great question, I'll explain. I'm just playing around, so I can use an arbitrarily limited form of notation. I am assuming that only single '* are encountered and that they do not occur next to each other. This leaves spaces to eliminate ambiguity: * not followed by a space opens an accent, and * does not precede a space, closes it.
Even so, I'm not sure how to proceed with Pyparsing. Some kind of stack based approach, pushing open * and popping them when they check as close? How can this be done with Pyparsing? Or is there a better approach?
source to share
With these additional rules, I don't think you need to worry about recursion at all, just handle opening and closing underscore expressions as you find them, whether they match or not:
from pyparsing import * openEmphasis = (LineStart() | White()) + Suppress('*') openEmphasis.setParseAction(lambda x: ''.join(x.asList()+['<em>'])) closeEmphasis = '*' + FollowedBy(White() | LineEnd()) closeEmphasis.setParseAction(lambda x: '</em>') emphasis = (openEmphasis | closeEmphasis).leaveWhitespace() test = """ *foo *bar* bar* """ print test print emphasis.transformString(test)
*foo *bar* bar* <em>foo <em>bar</em> bar</em>
You are not the first to travel through this kind of application. When I submitted to PyCon'06, getting carried away by a visitor to sort out some markdown, with an input string something like
"****a** b**** c**"
or something. We worked on this a bit together, but the disambiguation rules were too contextual for a basic pyrage parser.
source to share
Think about what you are asking for. When
does the second focus, and when does it open the nested focus? You have not provided any rules to distinguish this. Since this is always 100% ambiguous, this means that you can only get the following results:
- No accent can be closed, or
- No accent can be nested.
I doubt you are asking how to switch from the second to the first.
So what are you asking for?
You need to implement some rule to eliminate these two possibilities.
In fact, if you read the documents you linked to, they have a complex set of rules that define exactly when an
accent can open and when it cannot, as well as for closure; given these rules, if it's still ambiguous, it closes the accent. You must implement this.
source to share