Parsing
I need to parse a file with information separated by curly braces, for example:
Continent
{
Name Europe
Country
{
Name UK
Dog
{
Name Fiffi
Colour Gray
}
Dog
{
Name Smut
Colour Black
}
}
}
Here is what I tried in Python
from io import open
from pyparsing import *
import pprint
def parse(s):
return nestedExpr('{','}').parseString(s).asList()
def test(strng):
print strng
try:
cfgFile = file(strng)
cfgData = "".join( cfgFile.readlines() )
list = parse( cfgData )
pp = pprint.PrettyPrinter(2)
pp.pprint(list)
except ParseException, err:
print err.line
print " "*(err.column-1) + "^"
print err
cfgFile.close()
print
return list
if __name__ == '__main__':
test('testfile')
But this is not with an error:
testfile
Continent
^
Expected "{" (at char 0), (line:1, col:1)
Traceback (most recent call last):
File "xxx.py", line 55, in <module>
test('testfile')
File "xxx.py", line 40, in test
return list
UnboundLocalError: local variable 'list' referenced before assignment
What do I need to do to make this work? Is another parser better than pyrography?
source to share
Recursion is the key here. Try something around:
def parse(it):
result = []
while True:
try:
tk = next(it)
except StopIteration:
break
if tk == '}':
break
val = next(it)
if val == '{':
result.append((tk,parse(it)))
else:
result.append((tk, val))
return result
Use case:
import pprint
data = """
Continent
{
Name Europe
Country
{
Name UK
Dog
{
Name Fiffi
Colour Gray
}
Dog
{
Name Smut
Colour Black
}
}
}
"""
r = parse(iter(data.split()))
pprint.pprint(r)
... which produce (Python 2.6):
[('Continent', [('Name', 'Europe'), ('Country', [('Name', 'UK'), ('Dog', [('Name', 'Fiffi'), ('Colour', 'Gray')]), ('Dog', [('Name', 'Smut'), ('Colour', 'Black')])])])]
Please consider this only as a starting point and feel free to improve the code as needed (depending on your data, a dictionary might be a better choice). Also, the example code does not handle malformed data (in particular, redundant or missing data }
- I urge you to complete full test coverage;)
EDIT: Detection pyparsing
, I tried the following, which seems to work (much) better and could be (more) easily adapted for special needs:
import pprint
from pyparsing import Word, Literal, Forward, Group, ZeroOrMore, alphas
def syntax():
lbr = Literal( '{' ).suppress()
rbr = Literal( '}' ).suppress()
key = Word( alphas )
atom = Word ( alphas )
expr = Forward()
pair = atom | (lbr + ZeroOrMore( expr ) + rbr)
expr << Group ( key + pair )
return expr
expr = syntax()
result = expr.parseString(data).asList()
pprint.pprint(result)
Production:
[['Continent', ['Name', 'Europe'], ['Country', ['Name', 'UK'], ['Dog', ['Name', 'Fiffi'], ['Colour', 'Gray']], ['Dog', ['Name', 'Smut'], ['Colour', 'Black']]]]]
source to share
Nested expressions are so common and usually require defining a recursive parser or recursive code if you are not using a parsing library. This code can be tricky for beginners and error nestedExpr
prone even for experts, so I added a helper in pyparsing.
The problem you are having is that your input line contains more than just an expression of nested curly braces. When I first tried the parser, I try to keep testing as simple as possible - for example, I inserted a sample instead of reading it from a file, for example.
test = """\
Continent
{
Name Europe
Country
{
Name UK
Dog
{
Name Fiffi
Colour "light Gray"
}
Dog
{
Name Smut
Colour Black
}}}"""
from pyparsing import *
expr = nestedExpr('{','}')
print expr.parseString(test).asList()
And I am getting the same parsing error as you:
Traceback (most recent call last):
File "nb.py", line 25, in <module>
print expr.parseString(test).asList()
File "c:\python26\lib\site-packages\pyparsing-1.5.7-py2.6.egg\pyparsing.py", line 1006, in parseString
raise exc
pyparsing.ParseException: Expected "{" (at char 1), (line:1, col:1)
So when looking at the error message (and even its own debug code), pyparsing stumbles upon the leading word "Continent" because that word is not the start of a nested expression in curly braces, pyparsing (as we see in the exception message) was looking for the opening '{'.
The solution is to slightly modify your parser to handle the intro "Continent" label by changing the expression:
expr = Word(alphas) + nestedExpr('{','}')
Now, printing the results as a list (using pprint as the OP did, good job) looks like this:
['Continent',
['Name',
'Europe',
'Country',
['Name',
'UK',
'Dog',
['Name', 'Fiffi', 'Colour', '"light Gray"'],
'Dog',
['Name', 'Smut', 'Colour', 'Black']]]]
which should match your parenthesis nesting.
source to share