Analyzing chemical compounds in Haskell

Question

Analyzing chemical compounds in Haskell

I tried to do the parser chemistry as an exercise for myself, but I am stuck.

Here is the datatype I'm trying to use:

data Compound = Monoatomic String Int | Poliatomic [Compound] Int

For a line like "Ca (OH) 2" I want to get something like:

Poliatomic [Monoatomic "Ca" 1, Poliatomic [Monoatomic "O" 1, Monoatomic "H" 1] 2 ] 1

Monatomic type constructor for single atoms and Poliatomic type constructor for several atoms. In this example, (OH) 2 represents both an internal polyatomic structure and is represented as Poliatomic [(Monoatomic O 1), (Monoatomic H 1 )] 2

. The number 2

means we have two of these polyatomic structures.

I did it a lot,

import Data.Char (isUpper)
data Compound = Monoatomic String Int | Poliatomic [Compound] Int

instance Functor Compound where
        fmap f (Monoatomic s i) = Monoatomic (f s) i
        fmap f (Poliatomic xs i) = Poliatomic (fmap f xs) i

-- Change number of a compound
changeNumber :: Compound -> Int -> Compound
changeNumber (Monoatomic xs _) n = Monoatomic xs n
changeNumber (Poliatomic xs _) n = Poliatomic xs n

-- Take a partial compound and next chracter return partial compound
parseCompound :: Compound -> Char -> Compound
parseCompound (Poliatomic x:xs n) c
        | isUpper c = Poliatomic ((Monoatomic [c] 1):x:xs) n -- add new atom to compound
        | isLower c = Poliatomic 

-- I want to do foldl parseCompound (Poliatomic [] 1) inputstring

but then it became too difficult for me to continue.

It looks like it should be a pretty simple problem, but I'm very new to Haskell and can't figure out how to accomplish this function.

I have the following questions:

Is my approach fixed so far?
How can I make this work?

+3

parsing haskell

yasar Apr 21 15 at 16:50

source to share

1 answer

kqr · Accepted Answer · 2015-04-22T10:50:47+0000

I created the parser you are looking for with Parsec to give you an idea of what parser parsers look like since you stated you have little experience with it.

Even with a little Haskell experience, it should be fairly readable. I have provided some comments on those parts where there is something special to look for.

import Text.Read (readMaybe)
import Data.Maybe (fromMaybe)
import Text.Parsec (parse, many, many1, digit, char, string, (<|>), choice, try)
import Text.Parsec.String (Parser)


data Compound
  = Monoatomic String Int
  | Poliatomic [Compound] Int
  deriving Show


-- Run the substance parser on "Ca(OH)2" and print the result which is
-- Right (Poliatomic [Monoatomic "Ca" 1,Poliatomic [Monoatomic "O" 1,Monoatomic "H" 1] 2] 1)
main = print (parse substance "" "Ca(OH)2")


-- parse the many parts which make out the top-level polyatomic compound
--
-- "many1" means "at least one"
substance :: Parser Compound
substance = do
  topLevel <- many1 part
  return (Poliatomic topLevel 1)


-- a single part in a substance is either a poliatomic compound or a monoatomic compound
part :: Parser Compound
part = poliatomic <|> monoatomic


-- a poliatomic compound starts with a '(', then has many parts inside, then
-- ends with ')' and has a number after it which indicates how many of it there
-- are.
poliatomic :: Parser Compound
poliatomic = do
  char '('
  inner <- many1 part
  char ')'
  amount <- many1 digit
  return (Poliatomic inner (read amount))


-- a monoatomic compound is one of the many element names, followed by an
-- optional digit. if omitted, the amount defaults to 1.
--
-- "try" is a little special, and required in this case. it means "if a parser
-- fails, try the next one from where you started, not from where the last one
-- failed."
--
-- "choice" means "try all parsers in this list, stop when one matches"
--
-- "many" means "zero or more"
monoatomic :: Parser Compound
monoatomic = do
  name <- choice [try nameParser | nameParser <- atomstrings]
  amount <- many digit
  return (Monoatomic name (fromMaybe 1 (readMaybe amount)))


-- a list of parser for atom names. it is IMPORTANT that the longest names
-- come first. the reason for that is that it makes the parser much simpler to
-- write, and it can execute much faster. it common when designing parsers to
-- consider things like that when creating them.
atomstrings :: [Parser String]
atomstrings = map string (words "He Li Be Ne Na Mg Al Ca H B C N O F")

I've tried to write this code in a way that should be at least reasonably accessible to newbies, but it's probably not crystal clear, so I'm happy to answer any questions about it.

The parser above is the one you wanted. However, this is not the one I would write if I had free reins. If I did but wanted to, I would use the fact that

Ca(OH)2

can be represented as

(Ca)1((O)1(H)1)2

which is a much more homogeneous representation and in turn results in a simpler data structure and parser with fewer templates. The code I would rather write would look like

import Text.Read (readMaybe)
import Data.Maybe (fromMaybe)
import Control.Applicative ((<$>), (<*>), pure)
import Text.Parsec (parse, many, many1, digit, char, string, (<|>), choice, try, between)
import Text.Parsec.String (Parser)


data Substance
  = Part [Substance] Int
  | Atom String
  deriving Show


main = print (parse substance "" "Ca(OH)2")
-- Right (Part [Part [Atom "Ca"] 1,Part [Part [Atom "O"] 1,Part [Atom "H"] 1] 2] 1)

substance :: Parser Substance
substance = Part <$> many1 part <*> pure 1

part :: Parser Substance
part = do
  inner <- polyatomic <|> monoatomic
  amount <- fromMaybe 1 . readMaybe <$> many digit
  return (Part inner amount)

polyatomic :: Parser [Substance]
polyatomic = between (char '(') (char ')') (many1 part)

monoatomic :: Parser [Substance]
monoatomic = (:[]) . Atom <$> choice (map (try . string) atomstrings)

atomstrings :: [String]
atomstrings = words "He Li Be Ne Na Mg Al Ca H B C N O F"

This uses a few "advanced" tricks in Haskell (such as the <$>

and operators <*>

), so you might not be interested, OP, but I'm putting it in other people who might be more advanced Haskell users and Parsec learners.

This parser is about half a page as you can see, and that libraries like Parsec make them easy and fun to write parsers!

Analyzing chemical compounds in Haskell

More articles: