Pyparsing for paragraphs
I'm having a little pyparsing issue that I can't seem to solve. I would like to write a rule that will parse a multi-line paragraph for me. The end goal is to complete a recursive grammar that will parse something like:
Heading: awesome
This is a paragraph and then
a line break is inserted
then we have more text
but this is also a different line
with more lines attached
Other: cool
This is another indented block
possibly with more paragraphs
This is another way to keep this up
and write more things
But then we can keep writing at the old level
and get this
In something like HTML: so maybe (with a parse tree of course, I can convert this to whatever format I like).
<Heading class="awesome">
<p> This is a paragraph and then a line break is inserted and then we have more text </p>
<p> but this is also a different line with more lines attached<p>
<Other class="cool">
<p> This is another indented block possibly with more paragraphs</p>
<p> This is another way to keep this up and write more things</p>
</Other>
<p> But then we can keep writing at the old level and get this</p>
</Heading>
Progress
I have managed to get to the scene where I can parse the title bar and indented using pyparsing. But I can not:
- Define a paragraph as multiple lines to be connected
- Allow paragraph indents
Example
By following here , I can get the paragraphs to be output on a single line, but there doesn't seem to be a way to turn this into a parse tree without removing the line break characters.
I believe the paragraph should be:
words = ## I've defined words to allow a set of characters I need
lines = OneOrMore(words)
paragraph = OneOrMore(lines) + lineEnd
But that doesn't seem to work for me. Any ideas would be great :)
source to share
So I managed to solve this, for anyone else who stumbles upon this in the future. You can define this paragraph as follows. Although this is definitely not perfect, and it does not exactly match the grammar I described. Relevant code:
line = OneOrMore(CharsNotIn('\n')) + Suppress(lineEnd) emptyline = ~line paragraph = OneOrMore(line) + emptyline paragraph.setParseAction(join_lines)
Where is join_lines
defined as:
def join_lines(tokens):
stripped = [t.strip() for t in tokens]
joined = " ".join(stripped)
return joined
This should point you in the right direction if it suits your needs :) I hope this helps!
Best blank line
The definition of an empty string above is definitely not ideal and could be improved a lot. The best way I have found is this:
empty_line = Suppress(LineStart() + ZeroOrMore(" ") + LineEnd())
empty_line.setWhitespaceChars("")
This allows you to have blank lines filled with spaces without breaking the match.
source to share