Processing a re-structured text file with python

I have a large text file structured in blocks like:

Student = {
        PInfo = {
                ID   = 0001;
            Name.First = "Joe";
            Name.Last = "Burger";
            DOB  = "01/01/2000";
        };
        School = "West High";
        Address = {
            Str1 = "001 Main St.";
            Zip = 12345;
        };
    };
    Student = {
        PInfo = {
            ID   = 0002;
            Name.First = "John";
            Name.Last = "Smith";
            DOB  = "02/02/2002";
        };
        School = "East High";
        Address = {
            Str1 = "001 40nd St.";
            Zip = 12346;
        };
        Club = "Football";
    };
    ....

      

Student blocks have the same entries such as "PInfo", "School" and "Address", but some of them may have additional entries, such as "Club" information for "John Smith" that is not included for " Joe Burger. " I want to do to get the name, school name and zip code of each student and store them in a dictionary like

    {'Joe Burger':{School:'West High', Zip:12345}, 'John Smith':{School:'East High', Zip:12346}, ...}

      

Being new to python programming I tried to open the file and parse it line by line, but it looks so cumbersome. And the real file is quite large and more complex than the example above. I am wondering if there is an easier way to do this. Thanks in advance.

+3


source to share


3 answers


To parse a file, you can define a grammar that describes your input format and use that to generate a parser.

many parsers of the language in Python . For example, you can use Grako , which accepts grammars in EBNF variation and outputs memoizing PEG parsers in Python.

To install Grako, run pip install grako

.

Here's a grammar for your format using the Grako attribute of the EBNF syntax:

(* a file is zero or more records *)
file = { record }* $;
record = name '=' value ';' ;
name = /[A-Z][a-zA-Z0-9.]*/ ;
value = object | integer | string ;
(* an object contains one or more records *)
object = '{' { record }+ '}' ;
integer = /[0-9]+/ ;
string = '"' /[^"]*/ '"';

      

To generate a parser, save the grammar to a file, for example, Structured.ebnf

and run:



$ grako -o structured_parser.py Structured.ebnf

      

It creates a module structured_parser

that can be used to extract student information from input:

#!/usr/bin/env python
from structured_parser import StructuredParser

class Semantics(object):
    def record(self, ast):
        # record = name '=' value ';' ;
        # value = object | integer | string ;
        return ast[0], ast[2] # name, value
    def object(self, ast):
        # object = '{' { record }+ '}' ;
        return dict(ast[1])
    def integer(self, ast):
        # integer = /[0-9]+/ ;
        return int(ast)
    def string(self, ast):
        # string = '"' /[^"]*/ '"';
        return ast[1]

with open('input.txt') as file:
    text = file.read()
parser = StructuredParser()
ast = parser.parse(text, rule_name='file', semantics=Semantics())
students = [value for name, value in ast if name == 'Student']
d = {'{0[Name.First]} {0[Name.Last]}'.format(s['PInfo']):
     dict(School=s['School'], Zip=s['Address']['Zip'])
     for s in students}
from pprint import pprint
pprint(d)

      

Output

{'Joe Burger': {'School': u'West High', 'Zip': 12345},
 'John Smith': {'School': u'East High', 'Zip': 12346}}

      

+4


source


it's not json, but similar structured. you should be able to reformat it to json.



  • "=" β†’ ":"
  • quote all keys with ""
  • ";" β†’ ","
  • remove all "," followed by "}"
  • put it in curly braces
  • parse it with json.loads
+1


source


For such a thing, I am using Marpa :: R2 , Perl for Marpa, a generic BNF parser . This allows you to describe the text as grammar rules and divide them into an array tree (parse tree). You can then traverse the tree to store the results as a hash of hashes (a perl hash for a python dictionary) or use it as is.

I prepared a working example using your input: parser , result tree .

Hope it helps.

PS Example ast_traverse()

: Parses values ​​from a block of text based on certain keys

+1


source







All Articles