Pyparsing for paragraphs

I'm having a little pyparsing issue that I can't seem to solve. I would like to write a rule that will parse a multi-line paragraph for me. The end goal is to complete a recursive grammar that will parse something like:

Heading: awesome
    This is a paragraph and then
    a line break is inserted
    then we have more text

    but this is also a different line
    with more lines attached

    Other: cool
        This is another indented block
        possibly with more paragraphs

        This is another way to keep this up
        and write more things

    But then we can keep writing at the old level
    and get this

      

In something like HTML: so maybe (with a parse tree of course, I can convert this to whatever format I like).

<Heading class="awesome">

    <p> This is a paragraph and then a line break is inserted and then we have more text </p>

    <p> but this is also a different line with more lines attached<p>

    <Other class="cool">
        <p> This is another indented block possibly with more paragraphs</p>
        <p> This is another way to keep this up and write more things</p>
    </Other>

    <p> But then we can keep writing at the old level and get this</p>
</Heading>

      

Progress

I have managed to get to the scene where I can parse the title bar and indented using pyparsing. But I can not:

  • Define a paragraph as multiple lines to be connected
  • Allow paragraph indents

Example

By following here , I can get the paragraphs to be output on a single line, but there doesn't seem to be a way to turn this into a parse tree without removing the line break characters.

I believe the paragraph should be:

words = ## I've defined words to allow a set of characters I need
lines = OneOrMore(words)
paragraph = OneOrMore(lines) + lineEnd

      

But that doesn't seem to work for me. Any ideas would be great :)

+3


source to share


1 answer


So I managed to solve this, for anyone else who stumbles upon this in the future. You can define this paragraph as follows. Although this is definitely not perfect, and it does not exactly match the grammar I described. Relevant code:

line = OneOrMore(CharsNotIn('\n')) + Suppress(lineEnd)
emptyline = ~line
paragraph = OneOrMore(line) + emptyline
paragraph.setParseAction(join_lines)

      

Where is join_lines

defined as:

def join_lines(tokens):
    stripped = [t.strip() for t in tokens]
    joined = " ".join(stripped)
    return joined

      

This should point you in the right direction if it suits your needs :) I hope this helps!



Best blank line

The definition of an empty string above is definitely not ideal and could be improved a lot. The best way I have found is this:

empty_line = Suppress(LineStart() + ZeroOrMore(" ") + LineEnd())
empty_line.setWhitespaceChars("")

      

This allows you to have blank lines filled with spaces without breaking the match.

+3


source







All Articles