Analysis of a chemical formula with a mixture of elements

I would like to use boost :: spirit to extract the stoichiometry of multi-element compounds from a rough formula. Within a given compound, my analyzer should be able to distinguish between three types of chemical elements:

  • natural element composed of a mixture of isotopes in natural abundance
  • pure isotope
  • a mixture of isotopes in unnatural abundance

These samples are then used to analyze the following compounds:

  • "C" → natural carbon, composed of C [12] and C [13] in natural abundance
  • "CH4" → methane from natural carbon and hydrogen
  • "C2H {H [1] (0.8) H [2] (0.2)} 6" → ethane, made from natural C and unnatural H, composed of 80% hydrogen and 20% deuterium
  • "U [235]" → pure uranium 235

Obviously, the samples of chemical elements can be in any order (for example, CH [1] 4 and H [1] 4C ...) and frequencies.

I wrote my own parser that is pretty close to making this work, but I am still running into one problem.

Here is my code:

template <typename Iterator>
struct ChemicalFormulaParser : qi::grammar<Iterator,isotopesMixture(),qi::locals<isotopesMixture,double>>
{
    ChemicalFormulaParser(): ChemicalFormulaParser::base_type(_start)
    {

        namespace phx = boost::phoenix;

        // Semantic action for handling the case of pure isotope    
        phx::function<PureIsotopeBuilder> const build_pure_isotope = PureIsotopeBuilder();
        // Semantic action for handling the case of pure isotope mixture   
        phx::function<IsotopesMixtureBuilder> const build_isotopes_mixture = IsotopesMixtureBuilder();
        // Semantic action for handling the case of natural element   
        phx::function<NaturalElementBuilder> const build_natural_element = NaturalElementBuilder();

        phx::function<UpdateElement> const update_element = UpdateElement();

        // XML database that store all the isotopes of the periodical table
        ChemicalDatabaseManager<Isotope>* imgr=ChemicalDatabaseManager<Isotope>::Instance();
        const auto& isotopeDatabase=imgr->getDatabase();
        // Loop over the database to the spirit symbols for the isotopes names (e.g. H[1],C[14]) and the elements (e.g. H,C)
        for (const auto& isotope : isotopeDatabase) {
            _isotopeNames.add(isotope.second.getName(),isotope.second.getName());
            _elementSymbols.add(isotope.second.getProperty<std::string>("symbol"),isotope.second.getProperty<std::string>("symbol"));
        }

        _mixtureToken = "{" >> +(_isotopeNames >> "(" >> qi::double_ >> ")") >> "}";
        _isotopesMixtureToken = (_elementSymbols[qi::_a=qi::_1] >> _mixtureToken[qi::_b=qi::_1])[qi::_pass=build_isotopes_mixture(qi::_val,qi::_a,qi::_b)];

        _pureIsotopeToken = (_isotopeNames[qi::_a=qi::_1])[qi::_pass=build_pure_isotope(qi::_val,qi::_a)];
        _naturalElementToken = (_elementSymbols[qi::_a=qi::_1])[qi::_pass=build_natural_element(qi::_val,qi::_a)];

        _start = +( ( (_isotopesMixtureToken | _pureIsotopeToken | _naturalElementToken)[qi::_a=qi::_1] >>
                      (qi::double_|qi::attr(1.0))[qi::_b=qi::_1])[qi::_pass=update_element(qi::_val,qi::_a,qi::_b)] );

    }

    //! Defines the rule for matching a prefix
    qi::symbols<char,std::string> _isotopeNames;
    qi::symbols<char,std::string> _elementSymbols;

    qi::rule<Iterator,isotopesMixture()> _mixtureToken;
    qi::rule<Iterator,isotopesMixture(),qi::locals<std::string,isotopesMixture>> _isotopesMixtureToken;

    qi::rule<Iterator,isotopesMixture(),qi::locals<std::string>> _pureIsotopeToken;
    qi::rule<Iterator,isotopesMixture(),qi::locals<std::string>> _naturalElementToken;

    qi::rule<Iterator,isotopesMixture(),qi::locals<isotopesMixture,double>> _start;
};

      

In principle, each individual element template can be correctly analyzed using the appropriate semantic action, which gives both the mapping between the isotopes that build the connection and their corresponding stoichiometry. The problem starts when parsing the following compound:

CH{H[1](0.9)H[2](0.4)}

In this case, the semantic action build_isotopes_mixture

returns false, because 0.9 + 0.4 makes no sense for the sum of the relation. So I would expect and want my parser to fail for this connection. However, due to a rule _start

that uses an alternative operator for the three kinds of chemical element patterns, the parser manages to parse it into 1) discarding part {H[1](0.9)H[2](0.4)}

2) while keeping the previous H

3) parsing using it _naturalElementToken

. Is my grammar not clear enough to express as a parser? How do I use an alternate operator such that when an event is detected but yields false

when a semantic action is started, the parser stops?

+3


source to share


1 answer


How can I use an alternative operator in such a way that when an event is detected, but when the semantic action starts, it gives false, the parser stops?

In general, you achieve this by adding a wait point to prevent backtracking.

In this case, you are effectively "combining" several tasks:

  • corresponding input
  • interpretation of consistent input
  • checking for consistent input

Spirit transcends a suitable entrance, has excellent interpretation capabilities (mainly in the sense of creating an AST). However, things get nasty with on-the-fly confirmation.

A tip that I repeat often is to consider sharing concerns whenever possible. I would think

  • first, let's build a direct representation of the AST input,
  • transform / normalize / expand / canonicalize into a more convenient or meaningful domain representation
  • perform final checks on the result

This gives you the most expressive code while still maintaining support.

Because I am not well versed in the problem area and the sample code is not complete enough to induce it, I will not try to give a complete sample of what I mean. Instead, I'll try my best to sketch out the waiting point approach that I talked about at the beginning.

Sample compilation layout

It took the most time. (Think about how to do footwork for people to help you)

Live On Coliru

#include <boost/fusion/adapted/std_pair.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <map>

namespace qi = boost::spirit::qi;

struct DummyBuilder {
    using result_type = bool;

    template <typename... Ts>
    bool operator()(Ts&&...) const { return true; }
};

struct PureIsotopeBuilder     : DummyBuilder {  };
struct IsotopesMixtureBuilder : DummyBuilder {  };
struct NaturalElementBuilder  : DummyBuilder {  };
struct UpdateElement          : DummyBuilder {  };

struct Isotope {
    std::string getName() const { return _name; }

    Isotope(std::string const& name = "unnamed", std::string const& symbol = "?") : _name(name), _symbol(symbol) { }

    template <typename T> std::string getProperty(std::string const& name) const {
        if (name == "symbol")
            return _symbol;
        throw std::domain_error("no such property (" + name + ")");
    }

  private:
    std::string _name, _symbol;
};

using MixComponent    = std::pair<Isotope, double>;
using isotopesMixture = std::list<MixComponent>;

template <typename Isotope>
struct ChemicalDatabaseManager {
    static ChemicalDatabaseManager* Instance() {
        static ChemicalDatabaseManager s_instance;
        return &s_instance;
    }

    auto& getDatabase() { return _db; }
  private:
    std::map<int, Isotope> _db {
        { 1, { "H[1]",   "H" } },
        { 2, { "H[2]",   "H" } },
        { 3, { "Carbon", "C" } },
        { 4, { "U[235]", "U" } },
    };
};

template <typename Iterator>
struct ChemicalFormulaParser : qi::grammar<Iterator, isotopesMixture(), qi::locals<isotopesMixture, double> >
{
    ChemicalFormulaParser(): ChemicalFormulaParser::base_type(_start)
    {
        using namespace qi;
        namespace phx = boost::phoenix;

        phx::function<PureIsotopeBuilder>     build_pure_isotope;     // Semantic action for handling the case of pure isotope
        phx::function<IsotopesMixtureBuilder> build_isotopes_mixture; // Semantic action for handling the case of pure isotope mixture
        phx::function<NaturalElementBuilder>  build_natural_element;  // Semantic action for handling the case of natural element
        phx::function<UpdateElement>          update_element;

        // XML database that store all the isotopes of the periodical table
        ChemicalDatabaseManager<Isotope>* imgr = ChemicalDatabaseManager<Isotope>::Instance();
        const auto& isotopeDatabase=imgr->getDatabase();

        // Loop over the database to the spirit symbols for the isotopes names (e.g. H[1],C[14]) and the elements (e.g. H,C)
        for (const auto& isotope : isotopeDatabase) {
            _isotopeNames.add(isotope.second.getName(),isotope.second.getName());
            _elementSymbols.add(isotope.second.template getProperty<std::string>("symbol"),isotope.second.template getProperty<std::string>("symbol"));
        }

        _mixtureToken         = "{" >> +(_isotopeNames >> "(" >> double_ >> ")") >> "}";
        _isotopesMixtureToken = (_elementSymbols[_a=_1] >> _mixtureToken[_b=_1])[_pass=build_isotopes_mixture(_val,_a,_b)];

        _pureIsotopeToken     = (_isotopeNames[_a=_1])[_pass=build_pure_isotope(_val,_a)];
        _naturalElementToken  = (_elementSymbols[_a=_1])[_pass=build_natural_element(_val,_a)];

        _start = +( ( (_isotopesMixtureToken | _pureIsotopeToken | _naturalElementToken)[_a=_1] >>
                    (double_|attr(1.0))[_b=_1]) [_pass=update_element(_val,_a,_b)] );
    }

  private:
    //! Defines the rule for matching a prefix
    qi::symbols<char, std::string> _isotopeNames;
    qi::symbols<char, std::string> _elementSymbols;

    qi::rule<Iterator, isotopesMixture()> _mixtureToken;
    qi::rule<Iterator, isotopesMixture(), qi::locals<std::string, isotopesMixture> > _isotopesMixtureToken;

    qi::rule<Iterator, isotopesMixture(), qi::locals<std::string> > _pureIsotopeToken;
    qi::rule<Iterator, isotopesMixture(), qi::locals<std::string> > _naturalElementToken;

    qi::rule<Iterator, isotopesMixture(), qi::locals<isotopesMixture, double> > _start;
};

int main() {
    using It = std::string::const_iterator;
    ChemicalFormulaParser<It> parser;
    for (std::string const input : {
            "C",                        // --> natural carbon made of C[12] and C[13] in natural abundance
            "CH4",                      // --> methane made of natural carbon and hydrogen
            "C2H{H[1](0.8)H[2](0.2)}6", // --> ethane made of natural C and non-natural H made of 80% of hydrogen and 20% of deuterium
            "C2H{H[1](0.9)H[2](0.2)}6", // --> invalid mixture (total is 110%?)
            "U[235]",                   // --> pure uranium 235
        })
    {
        std::cout << " ============= '" << input << "' ===========\n";
        It f = input.begin(), l = input.end();
        isotopesMixture mixture;
        bool ok = qi::parse(f, l, parser, mixture);

        if (ok)
            std::cout << "Parsed successfully\n";
        else
            std::cout << "Parse failure\n";

        if (f != l)
            std::cout << "Remaining input unparsed: '" << std::string(f, l) << "'\n";
    }
}

      

Which, as stated, just prints

 ============= 'C' ===========
Parsed successfully
 ============= 'CH4' ===========
Parsed successfully
 ============= 'C2H{H[1](0.8)H[2](0.2)}6' ===========
Parsed successfully
 ============= 'C2H{H[1](0.9)H[2](0.2)}6' ===========
Parsed successfully
 ============= 'U[235]' ===========
Parsed successfully

      



General remarks:

  • no need for locals, just use regular placeholders:

    _mixtureToken         = "{" >> +(_isotopeNames >> "(" >> double_ >> ")") >> "}";
    _isotopesMixtureToken = (_elementSymbols >> _mixtureToken) [ _pass=build_isotopes_mixture(_val, _1, _2) ];
    
    _pureIsotopeToken     = _isotopeNames [ _pass=build_pure_isotope(_val, _1) ];
    _naturalElementToken  = _elementSymbols [ _pass=build_natural_element(_val, _1) ];
    
    _start = +( 
            ( (_isotopesMixtureToken | _pureIsotopeToken | _naturalElementToken) >>
              (double_|attr(1.0)) ) [ _pass=update_element(_val, _1, _2) ] 
        );
    
    // ....
    qi::rule<Iterator, isotopesMixture()> _mixtureToken;
    qi::rule<Iterator, isotopesMixture()> _isotopesMixtureToken;
    qi::rule<Iterator, isotopesMixture()> _pureIsotopeToken;
    qi::rule<Iterator, isotopesMixture()> _naturalElementToken;
    qi::rule<Iterator, isotopesMixture()> _start;
    
          

  • you will need to handle conflicts between names / characters (maybe just prioritizing one or the other)

  • Appropriate compilers will require a debugger template

    (unless I fully guessed your data structure, in which case I don't know what the template argument was supposed to mean ChemicalDatabaseManager

    ).

    Hint, MSVC is not a standards compliant compiler

Live On Coliru

Waiting point sketch

Assuming the "weights" should contain up to 100% inside the rule _mixtureToken

, we can either do build_isotopes_micture

"non-fictitious" or add confirmation:

struct IsotopesMixtureBuilder {
    bool operator()(isotopesMixture&/* output*/, std::string const&/* elementSymbol*/, isotopesMixture const& mixture) const {
        using namespace boost::adaptors;

        // validate weights total only
        return std::abs(1.0 - boost::accumulate(mixture | map_values, 0.0)) < 0.00001;
    }
};

      

However, as you noticed, this can interfere with recovery. Instead, you can / assert / to make any complete mixture up to 100%:

_mixtureToken         = "{" >> +(_isotopeNames >> "(" >> double_ >> ")") >> "}" > eps(validate_weight_total(_val));

      

With something like

struct ValidateWeightTotal {
    bool operator()(isotopesMixture const& mixture) const {
        using namespace boost::adaptors;

        bool ok = std::abs(1.0 - boost::accumulate(mixture | map_values, 0.0)) < 0.00001;
        return ok;
        // or perhaps just :
        return ok? ok : throw InconsistentsWeights {};
    }

    struct InconsistentsWeights : virtual std::runtime_error {
        InconsistentsWeights() : std::runtime_error("InconsistentsWeights") {}
    };
};

      

Live On Coliru

#include <boost/fusion/adapted/std_pair.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/range/adaptors.hpp>
#include <boost/range/numeric.hpp>
#include <map>

namespace qi = boost::spirit::qi;

struct DummyBuilder {
    using result_type = bool;

    template <typename... Ts>
    bool operator()(Ts&&...) const { return true; }
};

struct PureIsotopeBuilder     : DummyBuilder {  };
struct NaturalElementBuilder  : DummyBuilder {  };
struct UpdateElement          : DummyBuilder {  };

struct Isotope {
    std::string getName() const { return _name; }

    Isotope(std::string const& name = "unnamed", std::string const& symbol = "?") : _name(name), _symbol(symbol) { }

    template <typename T> std::string getProperty(std::string const& name) const {
        if (name == "symbol")
            return _symbol;
        throw std::domain_error("no such property (" + name + ")");
    }

  private:
    std::string _name, _symbol;
};

using MixComponent    = std::pair<Isotope, double>;
using isotopesMixture = std::list<MixComponent>;

struct IsotopesMixtureBuilder {
    bool operator()(isotopesMixture&/* output*/, std::string const&/* elementSymbol*/, isotopesMixture const& mixture) const {
        using namespace boost::adaptors;

        // validate weights total only
        return std::abs(1.0 - boost::accumulate(mixture | map_values, 0.0)) < 0.00001;
    }
};

struct ValidateWeightTotal {
    bool operator()(isotopesMixture const& mixture) const {
        using namespace boost::adaptors;

        bool ok = std::abs(1.0 - boost::accumulate(mixture | map_values, 0.0)) < 0.00001;
        return ok;
        // or perhaps just :
        return ok? ok : throw InconsistentsWeights {};
    }

    struct InconsistentsWeights : virtual std::runtime_error {
        InconsistentsWeights() : std::runtime_error("InconsistentsWeights") {}
    };
};

template <typename Isotope>
struct ChemicalDatabaseManager {
    static ChemicalDatabaseManager* Instance() {
        static ChemicalDatabaseManager s_instance;
        return &s_instance;
    }

    auto& getDatabase() { return _db; }
  private:
    std::map<int, Isotope> _db {
        { 1, { "H[1]",   "H" } },
        { 2, { "H[2]",   "H" } },
        { 3, { "Carbon", "C" } },
        { 4, { "U[235]", "U" } },
    };
};

template <typename Iterator>
struct ChemicalFormulaParser : qi::grammar<Iterator, isotopesMixture()>
{
    ChemicalFormulaParser(): ChemicalFormulaParser::base_type(_start)
    {
        using namespace qi;
        namespace phx = boost::phoenix;

        phx::function<PureIsotopeBuilder>     build_pure_isotope;     // Semantic action for handling the case of pure isotope
        phx::function<IsotopesMixtureBuilder> build_isotopes_mixture; // Semantic action for handling the case of pure isotope mixture
        phx::function<NaturalElementBuilder>  build_natural_element;  // Semantic action for handling the case of natural element
        phx::function<UpdateElement>          update_element;
        phx::function<ValidateWeightTotal>    validate_weight_total;

        // XML database that store all the isotopes of the periodical table
        ChemicalDatabaseManager<Isotope>* imgr = ChemicalDatabaseManager<Isotope>::Instance();
        const auto& isotopeDatabase=imgr->getDatabase();

        // Loop over the database to the spirit symbols for the isotopes names (e.g. H[1],C[14]) and the elements (e.g. H,C)
        for (const auto& isotope : isotopeDatabase) {
            _isotopeNames.add(isotope.second.getName(),isotope.second.getName());
            _elementSymbols.add(isotope.second.template getProperty<std::string>("symbol"), isotope.second.template getProperty<std::string>("symbol"));
        }

        _mixtureToken         = "{" >> +(_isotopeNames >> "(" >> double_ >> ")") >> "}" > eps(validate_weight_total(_val));
        _isotopesMixtureToken = (_elementSymbols >> _mixtureToken) [ _pass=build_isotopes_mixture(_val, _1, _2) ];

        _pureIsotopeToken     = _isotopeNames [ _pass=build_pure_isotope(_val, _1) ];
        _naturalElementToken  = _elementSymbols [ _pass=build_natural_element(_val, _1) ];

        _start = +( 
                ( (_isotopesMixtureToken | _pureIsotopeToken | _naturalElementToken) >>
                  (double_|attr(1.0)) ) [ _pass=update_element(_val, _1, _2) ] 
            );
    }

  private:
    //! Defines the rule for matching a prefix
    qi::symbols<char, std::string> _isotopeNames;
    qi::symbols<char, std::string> _elementSymbols;

    qi::rule<Iterator, isotopesMixture()> _mixtureToken;
    qi::rule<Iterator, isotopesMixture()> _isotopesMixtureToken;
    qi::rule<Iterator, isotopesMixture()> _pureIsotopeToken;
    qi::rule<Iterator, isotopesMixture()> _naturalElementToken;
    qi::rule<Iterator, isotopesMixture()> _start;
};

int main() {
    using It = std::string::const_iterator;
    ChemicalFormulaParser<It> parser;
    for (std::string const input : {
            "C",                        // --> natural carbon made of C[12] and C[13] in natural abundance
            "CH4",                      // --> methane made of natural carbon and hydrogen
            "C2H{H[1](0.8)H[2](0.2)}6", // --> ethane made of natural C and non-natural H made of 80% of hydrogen and 20% of deuterium
            "C2H{H[1](0.9)H[2](0.2)}6", // --> invalid mixture (total is 110%?)
            "U[235]",                   // --> pure uranium 235
        }) try 
    {
        std::cout << " ============= '" << input << "' ===========\n";
        It f = input.begin(), l = input.end();
        isotopesMixture mixture;
        bool ok = qi::parse(f, l, parser, mixture);

        if (ok)
            std::cout << "Parsed successfully\n";
        else
            std::cout << "Parse failure\n";

        if (f != l)
            std::cout << "Remaining input unparsed: '" << std::string(f, l) << "'\n";
    } catch(std::exception const& e) {
        std::cout << "Caught exception '" << e.what() << "'\n";
    }
}

      

Printing

 ============= 'C' ===========
Parsed successfully
 ============= 'CH4' ===========
Parsed successfully
 ============= 'C2H{H[1](0.8)H[2](0.2)}6' ===========
Parsed successfully
 ============= 'C2H{H[1](0.9)H[2](0.2)}6' ===========
Caught exception 'boost::spirit::qi::expectation_failure'
 ============= 'U[235]' ===========
Parsed successfully

      

+3


source







All Articles