Discard and Omit unstructured text with Perl Marpa?

Question

Discard and Omit unstructured text with Perl Marpa?

I am using Marpa :: R2 :: Scanless :: G to parse legacy text file format. The file format has a well-structured section at the top followed by a poorly structured mess of text and uuencoded. The last material can be completely ignored, but I cannot figure out how to tell the Marpa SLIF interface: everything is done; don't worry about the leftover text.

In very simplified terms, the file might look like this:

("field_a_val"  1,
 "field_b_vals" (1,2,3),
 "field_c_pairs" ((a 1)(b 2)(c 3))
)now_stuff_i_dont_care_about a;oiwermnv;alwfja;sldfa
asdf343avadfg;okm;om;oia3
e{<|1ydblV, HYED c"L. 78b."8
U=nK Wpw: Qh(e x!,~dU...

I have all the data I need aligned from the top, but when it hits the bottom junk, if I don't try to match it, I get: Error in SLIF parsing: Parse exhausted but lexemes remain.

I can't figure out how to create a term that says to decompose potentially megabytes of crap, just keep going to the end of the file regardless of the text found. No luck with my attempts to use: discard or "pause => after", although I am probably using them incorrectly.

For context, I don't have a clear understanding of parsing and lexing. I hit the grammar until it worked.

+3

perl marpa

rjt_jr 12 Sep 14 at 4:33

source to share

2 answers

There was once a discussion on a similar topic on the marpa-parser mailing list, but the code examples are somehow from there, so I would suggest a working example from my answer to another SO question .

Not sure if this is the correct way to do things like this in Marpa, although not tested for a few megabyte lines.

Hope it helps.

+1

rns 12 Sep 14 at 6:43

source to share

amon · Accepted Answer · 2014-09-12T09:18:06+0000

The simplest thing would be to present a token that matches everything else that you are not interested in:

lexeme default = latm => 1  # this prevents the rest from matching the whole document

Header
  ::= ActualHeader (AllTheRest) action => ::first
ActualHeader
  ::= ... # your code here
...

AllTheRest
  ::=           action => ::undef  # rest is optional
AllTheRest
  ::= THE_REST  action => ::undef  # matches anything
THE_REST ~ [\s\S]+

We cannot use the rule :discard

for THE_REST

because it will allow the rest to happen anywhere, but we only want to resolve it at the end. The character class [\s\S]

matches all characters.

Discard and Omit unstructured text with Perl Marpa?

More articles: