Analyzing chemical compounds in Haskell
I tried to do the parser chemistry as an exercise for myself, but I am stuck.
Here is the datatype I'm trying to use:
data Compound = Monoatomic String Int | Poliatomic [Compound] Int
For a line like "Ca (OH) 2" I want to get something like:
Poliatomic [Monoatomic "Ca" 1, Poliatomic [Monoatomic "O" 1, Monoatomic "H" 1] 2 ] 1
Monatomic type constructor for single atoms and Poliatomic type constructor for several atoms. In this example, (OH) 2 represents both an internal polyatomic structure and is represented as
Poliatomic [(Monoatomic O 1), (Monoatomic H 1 )] 2
. The number
means we have two of these polyatomic structures.
I did it a lot,
import Data.Char (isUpper) data Compound = Monoatomic String Int | Poliatomic [Compound] Int instance Functor Compound where fmap f (Monoatomic s i) = Monoatomic (f s) i fmap f (Poliatomic xs i) = Poliatomic (fmap f xs) i Change number of a compound changeNumber :: Compound -> Int -> Compound changeNumber (Monoatomic xs _) n = Monoatomic xs n changeNumber (Poliatomic xs _) n = Poliatomic xs n Take a partial compound and next chracter return partial compound parseCompound :: Compound -> Char -> Compound parseCompound (Poliatomic x:xs n) c isUpper c = Poliatomic ((Monoatomic [c] 1):x:xs) n -- add new atom to compound isLower c = Poliatomic I want to do foldl parseCompound (Poliatomic  1) inputstring
but then it became too difficult for me to continue.
It looks like it should be a pretty simple problem, but I'm very new to Haskell and can't figure out how to accomplish this function.
I have the following questions:
- Is my approach fixed so far?
- How can I make this work?
source to share
I created the parser you are looking for with Parsec to give you an idea of what parser parsers look like since you stated you have little experience with it.
Even with a little Haskell experience, it should be fairly readable. I have provided some comments on those parts where there is something special to look for.
import Text.Read (readMaybe) import Data.Maybe (fromMaybe) import Text.Parsec (parse, many, many1, digit, char, string, (<|>), choice, try) import Text.Parsec.String (Parser) data Compound = Monoatomic String Int | Poliatomic [Compound] Int deriving Show -- Run the substance parser on "Ca(OH)2" and print the result which is -- Right (Poliatomic [Monoatomic "Ca" 1,Poliatomic [Monoatomic "O" 1,Monoatomic "H" 1] 2] 1) main = print (parse substance "" "Ca(OH)2") -- parse the many parts which make out the top-level polyatomic compound -- -- "many1" means "at least one" substance :: Parser Compound substance = do topLevel <- many1 part return (Poliatomic topLevel 1) -- a single part in a substance is either a poliatomic compound or a monoatomic compound part :: Parser Compound part = poliatomic <|> monoatomic -- a poliatomic compound starts with a '(', then has many parts inside, then -- ends with ')' and has a number after it which indicates how many of it there -- are. poliatomic :: Parser Compound poliatomic = do char '(' inner <- many1 part char ')' amount <- many1 digit return (Poliatomic inner (read amount)) -- a monoatomic compound is one of the many element names, followed by an -- optional digit. if omitted, the amount defaults to 1. -- -- "try" is a little special, and required in this case. it means "if a parser -- fails, try the next one from where you started, not from where the last one -- failed." -- -- "choice" means "try all parsers in this list, stop when one matches" -- -- "many" means "zero or more" monoatomic :: Parser Compound monoatomic = do name <- choice [try nameParser | nameParser <- atomstrings] amount <- many digit return (Monoatomic name (fromMaybe 1 (readMaybe amount))) -- a list of parser for atom names. it is IMPORTANT that the longest names -- come first. the reason for that is that it makes the parser much simpler to -- write, and it can execute much faster. it common when designing parsers to -- consider things like that when creating them. atomstrings :: [Parser String] atomstrings = map string (words "He Li Be Ne Na Mg Al Ca H B C N O F")
I've tried to write this code in a way that should be at least reasonably accessible to newbies, but it's probably not crystal clear, so I'm happy to answer any questions about it.
The parser above is the one you wanted. However, this is not the one I would write if I had free reins. If I did but wanted to, I would use the fact that
can be represented as
which is a much more homogeneous representation and in turn results in a simpler data structure and parser with fewer templates. The code I would rather write would look like
import Text.Read (readMaybe) import Data.Maybe (fromMaybe) import Control.Applicative ((<$>), (<*>), pure) import Text.Parsec (parse, many, many1, digit, char, string, (<|>), choice, try, between) import Text.Parsec.String (Parser) data Substance = Part [Substance] Int | Atom String deriving Show main = print (parse substance "" "Ca(OH)2") -- Right (Part [Part [Atom "Ca"] 1,Part [Part [Atom "O"] 1,Part [Atom "H"] 1] 2] 1) substance :: Parser Substance substance = Part <$> many1 part <*> pure 1 part :: Parser Substance part = do inner <- polyatomic <|> monoatomic amount <- fromMaybe 1 . readMaybe <$> many digit return (Part inner amount) polyatomic :: Parser [Substance] polyatomic = between (char '(') (char ')') (many1 part) monoatomic :: Parser [Substance] monoatomic = (:) . Atom <$> choice (map (try . string) atomstrings) atomstrings :: [String] atomstrings = words "He Li Be Ne Na Mg Al Ca H B C N O F"
This uses a few "advanced" tricks in Haskell (such as the
), so you might not be interested, OP, but I'm putting it in other people who might be more advanced Haskell users and Parsec learners.
This parser is about half a page as you can see, and that libraries like Parsec make them easy and fun to write parsers!
source to share