Continuing lex after an error occurs

Question

Continuing lex after an error occurs

I am taking a course on compilers at my university. I am selecting a project using Haskell + Parsec. The lexer and parser must be separate. I am using Parsec to convert a string to a token list, which will then be passed to another Parsec parser that converts the token list to AST.

The problem is that the lexer has to keep trying lex, even in the event of an error. To try and do this, I entered a token representing the "unexpected token" for my Token datatype, and I tried to rotate my code with <| > unexpected to generate this token on error. It's a lot of templates, and it can also be difficult to know where to place them.

My preferred solution would be for somehow Parsec to automatically do this: if ever a ParseError, produce an unexpected token at that position and continue parsing one position later. How should I do it?

Here is the piece of code I have now: http://lpaste.net/8144414997276000256 For some reason, I can still get the parsing error, although the Unexpected Token should catch unhandled cases.

+3

haskell parsec

Diony rosa 12 Sep 14 at 17:26

source to share

1 answer

Tikhon jelvis · Accepted Answer · 2014-09-12T17:55:37+0000

It seems like you should get away with one additional term unexpected

. I am assuming you have a type token

that looks something like this:

token' =  number
      <|> identifier
      <|> ...

I would probably have each token ( number

, identifier

... etc.) manage its own spaces:

number :: Parser Token
number = Number . read <$> many1 digit <* spaces

Why don't you add an extra unexpected term as catch-all at the end of this?

token' =  number
      <|> identifier
      <|> ...
      <|> unexpected'

Whether he uses one character. You can even include a symbol in the value to improve error messages. Then when you use this to create a list, you will get the value unexpected

for each character that your lexer doesn't know what to do with it.

unexpected' :: Parser Token
unexpected' = Unexpected <$ anyChar

Finally, all lex is simple many token'

. In my tests, this works fine with invalid characters in the middle.

*Main> parse (many token') "<foo>" "1 2 abc ~ ~def"
Right [Number 1,Number 2,Identifier "abc",Unexpected,Unexpected,Unexpected,Identifier "def"]

Note that Parsec does not return by default . This means that if the parse does not parse part of the path through the token, it will not return and try unexpected

: instead, you will get an error. To enable backtracking, you must use try

in the parser, which may be a bug. For example, if identifier

two characters are required:

identifier :: Parser Token
identifier = Identifier <$> liftA2 (:) letter (many1 alphaNum) <* spaces

Then it can fail partially, and not back. But if you wrap it in try

, it should work:

token' =  number
      <|> try identifier
      <|> ...

The problem with this try

is that it can slow down your code if you're not careful. However, if you don't mind slowing down, you can get away by simply adding try

everywhere and discarding a lot!

Continuing lex after an error occurs

More articles: