Parse the given lines as long as the keyword with pyparsing

I am trying to parse the given lines and then group them in a list.

Here's my script:

from pyparsing import *

data = """START
line 2
line 3
line 4
END
START
line a
line b
line c
END
"""

EOL = LineEnd().suppress()
start = Keyword('START').suppress() + EOL
end = Keyword('END').suppress() + EOL

line = SkipTo(LineEnd()) + EOL
lines = start + OneOrMore(start | end | Group(line))

start.setDebug()
end.setDebug()
line.setDebug()

result = lines.parseString(data)
results_list = result.asList()

print(results_list)

      

This code was inspired by another stackoverflow question: Matching non-empty strings with pyrography

I need to parse everything from START to END line by line and store it in a list for each group (everything from START to END is one group). However, this script puts each line in a new group.

This is the result:

[['line 2'], ['line 3'], ['line 4'], ['line a'], ['line b'], ['line c'], ['']]

      

And I want it to be:

[['line 2', 'line 3', 'line 4'], ['line a', 'line b', 'line c']]

      

It also parses the empty line at the end.

I am a beginner peering, so I ask for your help.

thank

+2


source to share


1 answer


You can use nestedExpr

to find text separated by START

and END

.

If you are using

In [322]: pp.nestedExpr('START', 'END').searchString(data).asList()
Out[322]: 
[[['line', '2', 'line', '3', 'line', '4']],
 [['line', 'a', 'line', 'b', 'line', 'c']]]

      

then the text is split into spaces. (Note that we have the 'line', '2'

above where we want instead 'line 2'

). We would prefer it to just split into only '\n'

. Therefore, to fix this, we can use the pp.nestedExpr

function parameter content

, which allows us to control what counts as an element within the nested list. Source code for nestedExpr

defines

content = (Combine(OneOrMore(~ignoreExpr + 
                ~Literal(opener) + ~Literal(closer) +
                CharsNotIn(ParserElement.DEFAULT_WHITE_CHARS,exact=1))
            ).setParseAction(lambda t:t[0].strip()))

      

default where pp.ParserElement.DEFAULT_WHITE_CHARS

is



In [324]: pp.ParserElement.DEFAULT_WHITE_CHARS
Out[324]: ' \n\t\r'

      

This is what causes nextExpr

all whitespace to split . So if we reduce this to simple '\n'

, then it nestedExpr

breaks the content into lines, not all spaces.


import pyparsing as pp

data = """START
line 2
line 3
line 4
END
START
line a
line b
line c
END
"""

opener = 'START'
closer = 'END'
content = pp.Combine(pp.OneOrMore(~pp.Literal(opener) 
                                  + ~pp.Literal(closer) 
                                  + pp.CharsNotIn('\n',exact=1)))
expr = pp.nestedExpr(opener, closer, content=content)

result = [item[0] for item in expr.searchString(data).asList()]
print(result)

      

gives

[['line 2', 'line 3', 'line 4'], ['line a', 'line b', 'line c']]

      

+3


source







All Articles