Pyparsing a field that may or may not contain a value

I have a dataset that recovers the following:

Capture MICR - Serial: Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: Split:

The problem I am facing is I cannot figure out how I could properly record the capture for "Capture MICR - Serial Field". This field can be empty or contain alphanumeric varying lengths (I have the same problem with other fields that can be filled or empty.

I've tried some variations of the following, but I still don't get it.

pp.Literal("Capture MICR - Serial:") + pp.White(" ", min=1, max=0) + (pp.Word(pp.printables) ^ pp.White(" ", min=1, max=0))("crd_micr_serial") + pp.FollowedBy(pp.Literal("Pos44:"))

I think part of the problem is what is Or

parsed for the longest match, which in this case might be a long space character, with one alphanumeric, but I still want to write one value.

Thanks for the help.

+3


source to share


2 answers


The easiest way to parse text like "A: valueA B: valueB C: valueC" is to use the pyparsing SkipTo class:

a_expr = "A:" + SkipTo("B:")
b_expr = "B:" + SkipTo("C:")
c_expr = "C:" + SkipTo(LineEnd())
line_parser = a_expr + b_expr + c_expr

      

I would like to enlarge this a bit more:

  • add a parse action to strip off leading and trailing spaces

  • add the result name so it is easy to get the results after parsing the string

This is what this simple parser looks like:

NL = LineEnd()
a_expr = "A:" + SkipTo("B:").addParseAction(lambda t: [t[0].strip()])('A')
b_expr = "B:" + SkipTo("C:").addParseAction(lambda t: [t[0].strip()])('B')
c_expr = "C:" + SkipTo(NL).addParseAction(lambda t: [t[0].strip()])('C')
line_parser = a_expr + b_expr + c_expr

line_parser.runTests("""
    A: 100 B: Fred C:
    A:  B: a value with spaces C: 42
""")

      



gives:

 A: 100 B: Fred C:
['A:', '100', 'B:', 'Fred', 'C:', '']
- A: '100'
- B: 'Fred'
- C: ''


A:  B: a value with spaces C: 42
['A:', '', 'B:', 'a value with spaces', 'C:', '42']
- A: ''
- B: 'a value with spaces'
- C: '42'

      

I try to avoid copying / pasting code when I can, and rather automate "A followed by B" and "C follows end of line" with a list describing different prompt lines, then traversing that list to create each sub expression:

import pyparsing as pp

def make_prompt_expr(s):
    '''Define the expression for prompts as 'ABC:' '''
    return pp.Combine(pp.Literal(s) + ':')

def make_field_value_expr(next_expr):
    '''Define the expression for the field value as SkipTo(what comes next)'''
    return pp.SkipTo(next_expr).addParseAction(lambda t: [t[0].strip()])

def make_name(s):
    '''Convert prompt string to identifier form for results names'''
    return ''.join(s.split()).replace('-','_')

# use split to easily define list of prompts in order - makes it easy to update later if new prompts are added
prompts = "Capture MICR - Serial/Pos44/Trrt/Acct/Tc/Opt4/Split".split('/')

# keep a list of all the prompt-value expressions
exprs = []

# get a list of this-prompt, next-prompt pairs
for this_, next_ in zip(prompts, prompts[1:]  + [None]):
    field_name = make_name(this_)
    if next_ is not None:
        next_expr = make_prompt_expr(next_)
    else:
        next_expr = pp.LineEnd()

    # define the prompt-value expression for the current prompt string and add to exprs
    this_expr = make_prompt_expr(this_) + make_field_value_expr(next_expr)(field_name)
    exprs.append(this_expr)

# define a line parser as the And of all of the generated exprs
line_parser = pp.And(exprs)

line_parser.runTests("""\
Capture MICR - Serial:                  Pos44:  Trrt: 32904  Acct:        Tc:   2064        Opt4:          Split:
Capture MICR - Serial:  1729XYZ                Pos44:  Trrt: 32904  Acct:        Tc:   2064        Opt4: XXL         Split: 50
""")

      

gives:

Capture MICR - Serial:                  Pos44:  Trrt: 32904  Acct:        Tc:   2064        Opt4:          Split:
['Capture MICR - Serial:', '', 'Pos44:', '', 'Trrt:', '32904', 'Acct:', '', 'Tc:', '2064', 'Opt4:', '', 'Split:', '']
- Acct: ''
- CaptureMICR_Serial: ''
- Opt4: ''
- Pos44: ''
- Split: ''
- Tc: '2064'
- Trrt: '32904'


Capture MICR - Serial:  1729XYZ                Pos44:  Trrt: 32904  Acct:        Tc:   2064        Opt4: XXL         Split: 50
['Capture MICR - Serial:', '1729XYZ', 'Pos44:', '', 'Trrt:', '32904', 'Acct:', '', 'Tc:', '2064', 'Opt4:', 'XXL', 'Split:', '50']
- Acct: ''
- CaptureMICR_Serial: '1729XYZ'
- Opt4: 'XXL'
- Pos44: ''
- Split: '50'
- Tc: '2064'
- Trrt: '32904'

      

+1


source


Does it do what you want?

I Combine

only used it so that both hands Or

give similar results i.e. with "Pos44:" at the end of the result line where it can be separated. I am unhappy with coming to regex.



>>> import pyparsing as pp
>>> record_A = 'Capture MICR - Serial:                  Pos44:  Trrt: 32904  Acct:        Tc:   2064        Opt4:          Split:'
>>> record_B = 'Capture MICR - Serial: 76ZXP67            Pos44:  Trrt: 32904  Acct:        Tc:   2064        Opt4:          Split:'
>>> parser_fragment = pp.Combine(pp.White()+pp.Literal('Pos44:'))
>>> parser = pp.Literal('Capture MICR - Serial:')+pp.Or([parser_fragment,pp.Regex('.*?(?:Pos44\:)')])
>>> parser.parseString(record_A)
(['Capture MICR - Serial:', '                  Pos44:'], {})
>>> parser.parseString(record_B)
(['Capture MICR - Serial:', '76ZXP67            Pos44:'], {})

      

+1


source







All Articles