I have to read account data from a collapsed ASCII file, how would you protect future changes?

Question

I have to read account data from a collapsed ASCII file, how would you protect future changes?

I need to read the ascii files of an invoice file that are structured in a really convoluted way, like this:

55651108 3090617.10.0806:46:32101639Example Company               Construction Company          Example Road. 9            9524 Example City

There are actually additional things there, but I don't want to confuse you anymore.

I know that I am doomed if the client cannot offer a better structure. For example, 30906 is an iterative number that grows. 101639 - CustomerId. The spaces between Model Company and Construction Company are of variable length. The Example Company field can also contain variable-length spaces, such as Microsoft Corporation Redmond. It's the same with other fields. So there is no clear way to extract data from the last part.

But this is not a question. They took me away. My question is the following:

If the entrance was somewhat structured and well defined, how would you protect future changes to its structure. How would you design and implement a reader.

I was thinking about using a simple EAV model in my database and used text or XML templates describing the input data, object names and their values. I would parse the invoice files according to the templates.

0

parsing ascii

kitsune Dec 16 '08 at 13:36

source to share

4 answers

I think the pattern describing entity names and value types is good. Something like a "schema" for a text file.

What I will try to do is separate the reader from the rest of the application as much as possible. So the real question is how to define an interface that can accommodate changes in the parameter list. This may not always be possible, but still, if you rely on an interface to read data, you can change the reader implementation without affecting the rest of the system.

0

kgiannakakis Dec 16 '08 at 13:44

source to share

Well, your file format is similar to the French Etebac protocol used between banks and their clients.

It is a fixed-width text format.

The best you can do is use some kind of function unpack

:

$ perl -MData::Dumper -e 'print Dumper(unpack("A8 x A5 A8 A8 A6 A30 A30", "55651108 3090617.10.0806:46:32101639Example Company               Construction Company          Example Road. 9            9524 Example City"))'
$VAR1 = '55651108';
$VAR2 = '30906';
$VAR3 = '17.10.08';
$VAR4 = '06:46:32';
$VAR5 = '101639';
$VAR6 = 'Example Company';
$VAR7 = 'Construction Company';

What you have to do for each input, make sure it is what it should be, i.e. XX.XX.XX or YY: YY: YY or that it doesn't start with a space and break if it does.

0

mat Dec 16 '08 at 13:46

source to share

I will have a database of invoices, with tables like Company, Invoices, Invoice_Items. Depends on the complexity, would you like to record your orders as well and then link invoices to orders and so on? But I'm distracted ...

I would have a memory model in a memory model, but this. If XML output and input were needed, I would have XML model serialization if I needed to provide invoices as data elsewhere, and a SAX parser to read it. Some APIs can do this trivially, or maybe you just want to expose a web service in your repository if you have clients reading from you.

As for reading in text files (and there is little information about them, why change them? Where do they come from? Are you replacing this system, or will it continue to work and you are just a new backend that they feed?) You say the number of spaces is variable - is it just because the format is fixed width columns? I would create a reader that reads them in your model, and therefore your database schema.

0

JeeBee Dec 16 '08 at 13:49

source to share

S.Lott · Accepted Answer · 2008-12-16T14:05:09+0000

"If the input was somewhat structured and well defined, how would you protect future changes to its structure. How would you design and implement the reader?"

You must define the layout in such a way that you can flexibly highlight it.

Here's the python version

class Field( object ):
    def __init__( self, name, size ):
        self.name= name
        self.size = size
        self.offset= None

class Record( object ):
    def __init__( self, fieldList ):
        self.fields= fieldList
        self.fieldMap= {}
        offset= 0
        for f in self.fields: 
            f.offset= offset
            offset += f.size
            self.fieldMap[f.name]= f
    def parse( self, aLine ):
        self.buffer= aLine
    def get( self, aField ):
        fld= self.fieldMap[aField]
        return self.buffer[ fld.offset:fld.offset+fld.size+1 ]
    def __getattr__( self, aField ):
        return self.get(aField)

Now you can define records

myRecord= Record( 
    Field('aField',8), 
    Field('filler',1), 
    Field('another',5),
    Field('somethingElse',8),
)

This gives you the chance to fight some of the entrances in a flexible way.

myRecord.parse(input)
myRecord.get('aField')

Once you can analyze it, adding conversions is a matter of subclassing the field to define the different types (dates, amounts, etc.).

I have to read account data from a collapsed ASCII file, how would you protect future changes?

More articles: