Parsing multiple lines into a list of lists in Haskell
I am trying to parse a file that looks like this:
a b c
f e d
I want to match each of the characters in a string and parse everything into a list of lists, for example:
[[A, B, C], [D, E, F]]
To do this, I tried the following:
import Control.Monad
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Language
import qualified Text.ParserCombinators.Parsec.Token as P
parserP :: Parser [[MyType]]
parserP = do
x <- rowP
xs <- many (newline >> rowP)
return (x : xs)
rowP :: Parser [MyType]
rowP = manyTill cellP $ void newline <|> eof
cellP :: Parser (Cell Color)
cellP = aP <|> bP <|> ... -- rest of the parsers, they all look very similar
aP :: Parser MyType
aP = symbol "a" >> return A
bP :: Parser MyType
bP = symbol "b" >> return B
lexer = P.makeTokenParser emptyDef
symbol = P.symbol lexer
But it cannot return multiple internal lists. Instead, I get:
[[A, B, C, D, E, F]]
What am I doing wrong? I expected many to parse cellP to newline, but this is not the case.
You are correct that it manyTill
continues parsing to a new line. But he manyTill
will never see a new line, because he is cellP
too impatient. cellP
ends with a call P.symbol
whose documentation contains
symbol :: String -> ParsecT s u m String
The Lexeme parser character s parses the string s and skips the trailing space.
There is "white space" in the keyword. It turns out that Parsec defines whitespace as any character it satisfies isSpace
, which includes newlines. So P.symbol
happily consumes c
, followed by a space and manyTill
a newline, and then looks and doesn't see the newline because it's already consumed.
If you would like to opt out of the Parsec procedure, go to Benjamin's solution. But if you adhere strongly to it, the basic idea is that you want to change the language field whiteSpace
to correctly define whitespace so that they are not new. Something like
lexer = let lexer0 = P.makeTokenParser emptyDef
in lexer0 { whiteSpace = void $ many (oneOf " \t") }
This pseudocode and probably won't work for your specific case, but there is an idea. You want to change the definition to whiteSpace
what you want to define as whiteSpace
, not what the system defines by default. Note that changing this will also break the comment syntax if you have a specific one as it was whiteSpace
previously equipped to handle comments.
In short, Benjamin's answer is probably the best way to go. There is no real reason to use Parsec here. But it's also good to know why this particular solution didn't help: the default Parsec language definition was not intended to make newline references meaningful.
lines :: String -> [String]
and
words :: String -> [String]
to split the input and then map the individual tokens to
MyType
s.
toMyType :: String -> Maybe MyType
toMyType "a" = Just A
toMyType "b" = Just B
toMyType "c" = Just C
toMyType _ = Nothing
parseMyType :: String -> Maybe [[MyType]]
parseMyType = traverse (traverse toMyType) . fmap words . lines