Internal Boost :: Spirit code segfaults when parsing complex grammar

Question

Internal Boost :: Spirit code segfaults when parsing complex grammar

I'm trying to use Spirit to parse form expressions Module1.Module2.value

(any number of separable dots with an id, then a dot, then an OCaml style lowercase id). My current definition of a parser looks like this:

using namespace boost::spirit::qi;

template <typename Iter=std::string::iterator>
struct value_path : grammar<Iter, boost::tuple<std::vector<std::string>, std::string>()> {
    value_path() :
        value_path::base_type(start)
    {
        start = -(module_path<Iter>() >> '.') >> value_name<Iter>();
    }
    rule<Iter, boost::tuple<std::vector<std::string>, std::string>()> start;
};

where module_path

and value_name

are similar template structures inheriting from qi::grammar

with a single field start

assigned some Spirit rule, possibly using other custom grammars (for example, value_name

depends on lowercase_ident

and operator_name

, which are defined similarly) in the constructor.

When trying parse_phrase()

with this grammar, the segfaults program is somewhere inside the internal organs of Spirit (according to gdb). An equivalent definition where the constructor value_path

looks like this (I basically unwrapped all the custom grammars it depended on, leaving only the inline parses and trying to make it readable, which in retrospect was a crazy errand):

start =
-((raw[upper >> *(alnum | char_('_') | char_('\''))] % '.') >> '.')
>> lexeme[((lower | char_('_')) >> *(alnum | char_('_') | char_('\'')))
         | char_('(') >>
             ( ( (char_('!') >> *char_("-+!$%&*./:<=>?@^|~")
                 | (char_("~?") >> +char_("-+!$%&*./:<=>?@^|~"))
                 | ( (char_("-+=<>@^|&*/$%") >> *char_("-+!$%&*./:<=>?@^|~"))
                   | string("mod")
                   | string("lor")
                   | string("lsl")
                   | string("lsr")
                   | string("asr")
                   | string("or")
                   | string("-.")
                   | string("!=")
                   | string("||")
                   | string("&&")
                   | string(":=")
                   | char_("*+=<>&-")
                   )
                 ) >> char_(')')
               )
             )
         ];

not segfault and it seems to work correctly, however I would rather avoid something so verbose and unreadable in my code. It also doesn't expand at all.

So far, I have tried various combinations .alias()

, and also saved value_name<Iter>()

, module_path<Iter>()

and all intermediate grammars along the dependency chain into their own fields. None of them worked. How can I keep the abstraction of the first example high? Is there a standard way of writing grammars in the Spirit that doesn't run into problems?

+3

c ++ boost parsing boost-spirit

More Axes Dec 28. 14 at 16:20

source to share

1 answer

sehe · Accepted Answer · 2014-12-28T20:30:22+0000

You're in trouble because expression templates keep internal references to temporary ones.

Just copy the instances of the subparallel:

template <typename Iter=std::string::iterator>
struct value_path : grammar<Iter, boost::tuple<std::vector<std::string>, std::string>()> {
    value_path() : value_path::base_type(start)
    {
        start = -(module_path_ >> '.') >> value_name_;
    }
  private:

    rule<Iter, boost::tuple<std::vector<std::string>, std::string>()> start;
    module_path<Iter> module_path_;
    value_name<Iter> value_name_;
};

Notes. I feel like it might be a design scent to use separate subgrams for such small items. While grammar decomposition is often a good idea to keep build times manageable and the code size is somewhat lower, it looks like you might be overdoing it from the description here.

Plastering parser expressions behind qi::rule

(effectively erasing erasure) has perhaps a significant portion of the runtime. If you subsequently instantiate more than one type of iterator, you can compound this with unnecessary binary growth.

UPDATE As for the idiomatic way of spelling out your grammars in Spirit, here I take:

Live On Coliru

using namespace ascii;
using qi::raw;

lowercase_ident  = raw[ (lower | '_') >> *(alnum | '_' | '\'') ];
module_path_item = raw[ upper >> *(alnum | '_' | '\'') ];
module_path_     = module_path_item % '.';

auto special_char = boost::proto::deep_copy(char_("-+!$%&*./:<=>?@^|~"));

operator_name = qi::raw [
          ('!' >> *special_char)                          /* branch 1     */
        | (char_("~?") >> +special_char)                  /* branch 2     */
        | (!char_(".:") >> special_char >> *special_char) /* branch 3     */
        | "mod"                                           /* branch 4     */
        | "lor" | "lsl" | "lsr" | "asr" | "or"            /* branch 5-9   */
        | "-."                                            /* branch 10    doesn't match because of branch 3   */
        | "!=" | "||" | "&&" | ":="                       /* branch 11-14 doesn't match because of branch 1,3 */
     // | (special_char - char_("!$%./:?@^|~"))           /* "*+=<>&-" cannot match because of branch 3 */
    ]
    ;

value_name_  = 
      lowercase_ident
    | '(' >> operator_name >> ')'
    ;

start = -(module_path_ >> '.') >> value_name_;

If the rules are fields declared as:

qi::rule<Iter, ast::value_path(),  Skipper> start;
qi::rule<Iter, ast::module_path(), Skipper> module_path_;

// lexeme: (no skipper)
qi::rule<Iter, std::string()> value_name_, module_path_item, lowercase_ident, operator_name;

Notes:

I added a skipper because since your grammar was value_path

not using one of them, any skipper you passed to qi::phrase_parse

was ignored
The tokens just remove the skipper from the rule declaration type, so you don't even need to specify qi::lexeme[]
In tokens, I copied your intent to just copy the processed text verbatim using qi::raw

. This allows us to write grammars more concisely (using '!'

instead of char_('!')

, "mod"

instead of qi::string("mod")

). Note that bare literals are implicitly transformed into "non-capturing" qi::lit(...)

nodes in the context of the Qi parser expression, but since we used it raw[]

anyway, the fact that the lit

attribute is not mapping is not a problem.

I think this leads to a completely contradictory definition of grammar that should satisfy your "high level" criteria. There are some wtf-y-ness with the grammar itself (regardless of its expression, any parser generator language):

I've simplified the rule operator_name

by removing the nesting of the alternate branches, which will have the same effect as the simplified simplified alternate list
I've reworked the "magic" special character lists into special_chars

In alternate thread 3, for example, I noticed negative assertion exceptions:

(!char_(".:") >> special_char >> *special_char) /* branch 3     */

The statement !char_(".:")

says when the input will not match '.'

or ':'

continue matching (any sequence of special characters). In fact, you could equivalently write it as:

((special_char - '.' - ':') >> *special_char) /* branch 3     */

or even, since I ended up writing it:

(!char_(".:") >> +special_char) /* branch 3     */

Simplifying the branches actually increases the level of abstraction! Now it becomes clear that some branches will never match, because the previous branches match input by definition:

   | "-."                                    /* branch 10    doesn't match because of branch 3   */
   | "!=" | "||" | "&&" | ":="               /* branch 11-14 doesn't match because of branch 1,3 */
// | (special_char - char_("!$%./:?@^|~"))   /* "*+=<>&-" cannot match because of branch 3 */

I hope you can understand why I qualify this part of the grammar as "a little wtf-y" :) I am guessing that now you get confused or something went wrong when you reduced it to one rule (your "crazy errand").

Some additional improvements should be noted:

I added the correct AST structure instead boost::tuple<>

to make the code clearer
I have added BOOST_SPIRIT_DEBUG * macros so you can debug high level grammar (rule level)
I removed the blanket using namespace

. This is usually a bad idea. And with Spirit, this is often a very bad idea (it can lead to ambiguity that is insoluble or very difficult to spot errors). As you can see, this does not necessarily result in very verbose code.

Full list

#define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/adapted.hpp>

namespace qi    = boost::spirit::qi;
namespace ascii = boost::spirit::ascii;

namespace ast {
    using module_path = std::vector<std::string>;
    struct value_path {
        module_path module;
        std::string   value_expr;
    };
}

BOOST_FUSION_ADAPT_STRUCT(ast::value_path, (ast::module_path, module)(std::string,value_expr))

template <typename Iter, typename Skipper = ascii::space_type>
struct value_path : qi::grammar<Iter, ast::value_path(), Skipper> {
    value_path() : value_path::base_type(start)
    {
        using namespace ascii;
        using qi::raw;

        lowercase_ident  = raw[ (lower | '_') >> *(alnum | '_' | '\'') ];
        module_path_item = raw[ upper >> *(alnum | '_' | '\'') ];
        module_path_     = module_path_item % '.';

        auto special_char = boost::proto::deep_copy(char_("-+!$%&*./:<=>?@^|~"));

        operator_name = qi::raw [
                  ('!'          >> *special_char)         /* branch 1     */
                | (char_("~?")  >> +special_char)         /* branch 2     */
                | (!char_(".:") >> +special_char)         /* branch 3     */
                | "mod"                                   /* branch 4     */
                | "lor" | "lsl" | "lsr" | "asr" | "or"    /* branch 5-9   */
                | "-."                                    /* branch 10    doesn't match because of branch 3   */
                | "!=" | "||" | "&&" | ":="               /* branch 11-14 doesn't match because of branch 1,3 */
             // | (special_char - char_("!$%./:?@^|~"))   /* "*+=<>&-" cannot match because of branch 3 */
            ]
            ;

        value_name_  = 
              lowercase_ident
            | '(' >> operator_name >> ')'
            ;

        start = -(module_path_ >> '.') >> value_name_;

        BOOST_SPIRIT_DEBUG_NODES((start)(module_path_)(value_name_)(module_path_item)(lowercase_ident)(operator_name))
    }
  private:
    qi::rule<Iter, ast::value_path(),  Skipper> start;
    qi::rule<Iter, ast::module_path(), Skipper> module_path_;

    // lexeme: (no skipper)
    qi::rule<Iter, std::string()> value_name_, module_path_item, lowercase_ident, operator_name;
};

int main()
{
    for (std::string const input : { 
            "Some.Module.Package.ident",
            "ident",
            "A.B.C_.mod",    // as lowercase_ident
            "A.B.C_.(mod)",  // as operator_name (branch 4)
            "A.B.C_.(!=)",   // as operator_name (branch 1)
            "(!)"            // as operator_name (branch 1)
            })
    {
        std::cout << "--------------------------------------------------------------\n";
        std::cout << "Parsing '" << input << "'\n";

        using It = std::string::const_iterator;
        It f(input.begin()), l(input.end());

        value_path<It> g;
        ast::value_path data;
        bool ok = qi::phrase_parse(f, l, g, ascii::space, data);
        if (ok) {
            std::cout << "Parse succeeded\n";
        } else {
            std::cout << "Parse failed\n";
        }

        if (f!=l)
            std::cout << "Remaining unparsed: '" << std::string(f,l) << "'\n";
    }
}

Debug output

--------------------------------------------------------------
Parsing 'Some.Module.Package.ident'
<start>
  <try>Some.Module.Package.</try>
  <module_path_>
    <try>Some.Module.Package.</try>
    <module_path_item>
      <try>Some.Module.Package.</try>
      <success>.Module.Package.iden</success>
      <attributes>[[S, o, m, e]]</attributes>
    </module_path_item>
    <module_path_item>
      <try>Module.Package.ident</try>
      <success>.Package.ident</success>
      <attributes>[[M, o, d, u, l, e]]</attributes>
    </module_path_item>
    <module_path_item>
      <try>Package.ident</try>
      <success>.ident</success>
      <attributes>[[P, a, c, k, a, g, e]]</attributes>
    </module_path_item>
    <module_path_item>
      <try>ident</try>
      <fail/>
    </module_path_item>
    <success>.ident</success>
    <attributes>[[[S, o, m, e], [M, o, d, u, l, e], [P, a, c, k, a, g, e]]]</attributes>
  </module_path_>
  <value_name_>
    <try>ident</try>
    <lowercase_ident>
      <try>ident</try>
      <success></success>
      <attributes>[[i, d, e, n, t]]</attributes>
    </lowercase_ident>
    <success></success>
    <attributes>[[i, d, e, n, t]]</attributes>
  </value_name_>
  <success></success>
  <attributes>[[[[S, o, m, e], [M, o, d, u, l, e], [P, a, c, k, a, g, e]], [i, d, e, n, t]]]</attributes>
</start>
Parse succeeded
--------------------------------------------------------------
Parsing 'ident'
<start>
  <try>ident</try>
  <module_path_>
    <try>ident</try>
    <module_path_item>
      <try>ident</try>
      <fail/>
    </module_path_item>
    <fail/>
  </module_path_>
  <value_name_>
    <try>ident</try>
    <lowercase_ident>
      <try>ident</try>
      <success></success>
      <attributes>[[i, d, e, n, t]]</attributes>
    </lowercase_ident>
    <success></success>
    <attributes>[[i, d, e, n, t]]</attributes>
  </value_name_>
  <success></success>
  <attributes>[[[], [i, d, e, n, t]]]</attributes>
</start>
Parse succeeded
--------------------------------------------------------------
Parsing 'A.B.C_.mod'
<start>
  <try>A.B.C_.mod</try>
  <module_path_>
    <try>A.B.C_.mod</try>
    <module_path_item>
      <try>A.B.C_.mod</try>
      <success>.B.C_.mod</success>
      <attributes>[[A]]</attributes>
    </module_path_item>
    <module_path_item>
      <try>B.C_.mod</try>
      <success>.C_.mod</success>
      <attributes>[[B]]</attributes>
    </module_path_item>
    <module_path_item>
      <try>C_.mod</try>
      <success>.mod</success>
      <attributes>[[C, _]]</attributes>
    </module_path_item>
    <module_path_item>
      <try>mod</try>
      <fail/>
    </module_path_item>
    <success>.mod</success>
    <attributes>[[[A], [B], [C, _]]]</attributes>
  </module_path_>
  <value_name_>
    <try>mod</try>
    <lowercase_ident>
      <try>mod</try>
      <success></success>
      <attributes>[[m, o, d]]</attributes>
    </lowercase_ident>
    <success></success>
    <attributes>[[m, o, d]]</attributes>
  </value_name_>
  <success></success>
  <attributes>[[[[A], [B], [C, _]], [m, o, d]]]</attributes>
</start>
Parse succeeded
--------------------------------------------------------------
Parsing 'A.B.C_.(mod)'
<start>
  <try>A.B.C_.(mod)</try>
  <module_path_>
    <try>A.B.C_.(mod)</try>
    <module_path_item>
      <try>A.B.C_.(mod)</try>
      <success>.B.C_.(mod)</success>
      <attributes>[[A]]</attributes>
    </module_path_item>
    <module_path_item>
      <try>B.C_.(mod)</try>
      <success>.C_.(mod)</success>
      <attributes>[[B]]</attributes>
    </module_path_item>
    <module_path_item>
      <try>C_.(mod)</try>
      <success>.(mod)</success>
      <attributes>[[C, _]]</attributes>
    </module_path_item>
    <module_path_item>
      <try>(mod)</try>
      <fail/>
    </module_path_item>
    <success>.(mod)</success>
    <attributes>[[[A], [B], [C, _]]]</attributes>
  </module_path_>
  <value_name_>
    <try>(mod)</try>
    <lowercase_ident>
      <try>(mod)</try>
      <fail/>
    </lowercase_ident>
    <operator_name>
      <try>mod)</try>
      <success>)</success>
      <attributes>[[m, o, d]]</attributes>
    </operator_name>
    <success></success>
    <attributes>[[m, o, d]]</attributes>
  </value_name_>
  <success></success>
  <attributes>[[[[A], [B], [C, _]], [m, o, d]]]</attributes>
</start>
Parse succeeded
--------------------------------------------------------------
Parsing 'A.B.C_.(!=)'
<start>
  <try>A.B.C_.(!=)</try>
  <module_path_>
    <try>A.B.C_.(!=)</try>
    <module_path_item>
      <try>A.B.C_.(!=)</try>
      <success>.B.C_.(!=)</success>
      <attributes>[[A]]</attributes>
    </module_path_item>
    <module_path_item>
      <try>B.C_.(!=)</try>
      <success>.C_.(!=)</success>
      <attributes>[[B]]</attributes>
    </module_path_item>
    <module_path_item>
      <try>C_.(!=)</try>
      <success>.(!=)</success>
      <attributes>[[C, _]]</attributes>
    </module_path_item>
    <module_path_item>
      <try>(!=)</try>
      <fail/>
    </module_path_item>
    <success>.(!=)</success>
    <attributes>[[[A], [B], [C, _]]]</attributes>
  </module_path_>
  <value_name_>
    <try>(!=)</try>
    <lowercase_ident>
      <try>(!=)</try>
      <fail/>
    </lowercase_ident>
    <operator_name>
      <try>!=)</try>
      <success>)</success>
      <attributes>[[!, =]]</attributes>
    </operator_name>
    <success></success>
    <attributes>[[!, =]]</attributes>
  </value_name_>
  <success></success>
  <attributes>[[[[A], [B], [C, _]], [!, =]]]</attributes>
</start>
Parse succeeded
--------------------------------------------------------------
Parsing '(!)'
<start>
  <try>(!)</try>
  <module_path_>
    <try>(!)</try>
    <module_path_item>
      <try>(!)</try>
      <fail/>
    </module_path_item>
    <fail/>
  </module_path_>
  <value_name_>
    <try>(!)</try>
    <lowercase_ident>
      <try>(!)</try>
      <fail/>
    </lowercase_ident>
    <operator_name>
      <try>!)</try>
      <success>)</success>
      <attributes>[[!]]</attributes>
    </operator_name>
    <success></success>
    <attributes>[[!]]</attributes>
  </value_name_>
  <success></success>
  <attributes>[[[], [!]]]</attributes>
</start>
Parse succeeded

Internal Boost :: Spirit code segfaults when parsing complex grammar

Full list

Debug output

More articles: