Parse the given lines as long as the keyword with pyparsing
I am trying to parse the given lines and then group them in a list.
Here's my script:
from pyparsing import *
data = """START
line 2
line 3
line 4
END
START
line a
line b
line c
END
"""
EOL = LineEnd().suppress()
start = Keyword('START').suppress() + EOL
end = Keyword('END').suppress() + EOL
line = SkipTo(LineEnd()) + EOL
lines = start + OneOrMore(start | end | Group(line))
start.setDebug()
end.setDebug()
line.setDebug()
result = lines.parseString(data)
results_list = result.asList()
print(results_list)
This code was inspired by another stackoverflow question: Matching non-empty strings with pyrography
I need to parse everything from START to END line by line and store it in a list for each group (everything from START to END is one group). However, this script puts each line in a new group.
This is the result:
[['line 2'], ['line 3'], ['line 4'], ['line a'], ['line b'], ['line c'], ['']]
And I want it to be:
[['line 2', 'line 3', 'line 4'], ['line a', 'line b', 'line c']]
It also parses the empty line at the end.
I am a beginner peering, so I ask for your help.
thank
source to share
You can use nestedExpr
to find text separated by START
and END
.
If you are using
In [322]: pp.nestedExpr('START', 'END').searchString(data).asList()
Out[322]:
[[['line', '2', 'line', '3', 'line', '4']],
[['line', 'a', 'line', 'b', 'line', 'c']]]
then the text is split into spaces. (Note that we have the 'line', '2'
above where we want instead 'line 2'
). We would prefer it to just split into only '\n'
. Therefore, to fix this, we can use the pp.nestedExpr
function parameter content
, which allows us to control what counts as an element within the nested list. Source code for nestedExpr
defines
content = (Combine(OneOrMore(~ignoreExpr + ~Literal(opener) + ~Literal(closer) + CharsNotIn(ParserElement.DEFAULT_WHITE_CHARS,exact=1)) ).setParseAction(lambda t:t[0].strip()))
default where pp.ParserElement.DEFAULT_WHITE_CHARS
is
In [324]: pp.ParserElement.DEFAULT_WHITE_CHARS
Out[324]: ' \n\t\r'
This is what causes nextExpr
all whitespace to split . So if we reduce this to simple '\n'
, then it nestedExpr
breaks the content into lines, not all spaces.
import pyparsing as pp
data = """START
line 2
line 3
line 4
END
START
line a
line b
line c
END
"""
opener = 'START'
closer = 'END'
content = pp.Combine(pp.OneOrMore(~pp.Literal(opener)
+ ~pp.Literal(closer)
+ pp.CharsNotIn('\n',exact=1)))
expr = pp.nestedExpr(opener, closer, content=content)
result = [item[0] for item in expr.searchString(data).asList()]
print(result)
gives
[['line 2', 'line 3', 'line 4'], ['line a', 'line b', 'line c']]
source to share