LALR ambiguity resolution

I recently wrapped my head around LALR to write a LALR generator , and I am trying to build a java or C # -style grammar for it (which starts here ).

I know it's an effort to write a parser generator, like reinventing the wheel for example (why not use Antlr?), But my goal is to load a hobby subsystem that can compile itself, depending on a third party toolchain. My problem, though not with the generator, but with the grammar.

I am working on reducing / reducing ambiguities with operators and expressions.

I know how to resolve certain types of ambiguities such as dangling-else, but these few are not intuitive to me and they puzzled me.

What's the best way to solve these problems? Also, is there a prototyping tool that I can use to visualize the solution? Or if I go back to square and try to implement a GLR parser generator for grammar?

These statements are legal:

Generic.List<int> myVar1 = x + 4, myVar2; // stmt -> var-decl ;
                                          // var-decl -> type-name var-decl-list

t = 99;                           // simple-stmt -> assign

i++;                              // simple-stmt -> incr-decr
                                  // incr-decr -> primary-expr ++

json.deserialize<list<int>>(obj); // simple-stmt -> call
                                  // call -> primary-expr ( params )
                                  // ...  -> primary-expr . basic-name ( params )
                                  // ...  -> basic-name . basic-name ( params )

      

This is how it is configured:

basic-name : ident < type-list >
           | ident

nested-name : nested-name . basic-name
            | basic-name

basic-type : int | bool | ...

type-name : nested-name
          | basic-type

stmt : var-decl ;
     | simple-stmt ;
     | ...

var-decl : type-name var-decl-list

var-decl-list : var-decl-list , var-init
              | var-init

var-init : ident assign-op expression
         | ident

simple-stmt : assign
            | call
            | incr-decr

expr : assign-expr

assign-expr : assign
            | cond-expr

assign : unary-expr assign-op expr

...
rel-expr : rel-expr < shift-expr
         ...
         | shift-expr

...
unary-expr : unary-op primary-expr
           | primary-expr

unary-op : + - ! ~ ++ --  // Prefix operators
         | ( type-name )  // Conversion operator

primary-expr : call
             | primary-expr . basic-name
             | primary-expr ++
             | ( expr )
             ...
             | basic-name

call : primary-expr ( params )

incr-decr : primary-expr ++
          | -- primary-expr
          | ...

      

So when the parser is expecting a statement, the core of the element set * LR (k) is method-body -> { * stmts-opt }

, and the complete element given for the state looks something like this:

method-body -> { * stmts-opt }
stmts-opt -> * stmts
stmts-opt -> *
stmts -> * stmts stmt
stmt -> * var-decl ;
stmt -> * simple-stmt ;
var-decl -> * type-name var-decl-list
simple-stmt -> * assign
simple-stmt -> * call
simple-stmt -> * incr-decr
type-name -> * nested-name
type-name -> * basic-type
nested-name -> * nested-name . basic-name
nested-name -> * basic-name
basic-name -> * ident < type-list >
basic-name -> * ident
assign -> * unary-expr assign-op expr
unary-expr -> * unary-op primary-expr
unary-expr -> * primary-expr
unary-op -> * ( typename )
unary-op -> * ! | ~ | ...
primary-expr -> * call
primary-expr -> * primary-expr . basic-name
primary-expr -> * primary-expr ++
primary-expr -> * basic-name
primary-expr -> * ( expr )
call -> * primary-expr ( params )
incr-decr -> * primary-expr ++
...

      

When the identifier is shifted, the following state is:

basic-name -> ident *
basic-name -> ident * < type-list >

      

which is disassembled or reduced, and brings the following state:

nested-name -> basic-name *
primary-expr -> basic-name *

      

Potential conflict. In the parent state, lookaheads do not help as there is a dot in nested-name

and primary-expr

. Oh, okay, let's try decreasing by nested name:

type-name -> nested-name *
nested-name -> nested-name * . basic-name

      

Can't see anything here ... Now, how about shortening to primary-expr

:

unary-expr -> primary-expr *
primary-expr -> primary-expr * . basic-name
primary-expr -> primary-expr * ++
call -> primary-expr * ( params )
incr-decr -> primary-expr * ++
...

      

Now when we shift ++ we get:

primary-expr -> primary-expr ++ *
incr-decr -> primary-expr ++ *

      

... another conflict with decreasing decreasing.

Try moving (

instead ident

:

primary-expr -> ( * expr )
unary-op -> ( * type-name )
expr -> * assign-expr
assign-expr -> * assign
assign-expr -> * cond-expr
assign -> * unary-expr assign-op expr
unary-expr -> * unary-op primary-expr
unary-expr -> * primary-expr
unary-op -> * ( typename )
unary-op -> * ! | ~ | ...
primary-expr -> * call
primary-expr -> * primary-expr . basic-name
primary-expr -> * primary-expr ++
primary-expr -> * basic-name
primary-expr -> * ( expr )
call -> * primary-expr ( params )
cond-expr -> * ...
...
rel-expr -> * rel-expr < shift-expr
rel-expr -> * shift-expr
...
type-name -> * nested-name
type-name -> * basic-type
nested-name -> * nested-name . basic-name
nested-name -> * basic-name
basic-name -> * ident < type-list >
basic-name -> * ident

      

Same problems when you push ident

or (

stack.

These are just the ones I ran into. Since basic-name

takes precedence over rel-expr

, I'm guessing something x < n

would seem to be interpreted as basic-name -> ident < type-list *

errors if it were actually a relational expression.

My brain has reached the point where I can really help.

+3


source to share


1 answer


Your post has a couple of questions which makes it not ideal for SO. But I will try to give some thoughts on each of them. As I see it, you have three problems:

  • Differentiate expression expressions from expressions that are not instructions.

  • Parsing hierarchically named types in a declaration without conflicts with field access expressions in expressions

  • The difference between using <as a comparison operator and as a template parenthesis.


1. The difference between expression expressions and expressions that are not instructions.

As I understand it, the desire is to only allow operator expressions that have (or potentially have) some side effect: assignments, incremental mutations, and subroutine calls. Roughly speaking, this corresponds to Java, whose grammar includes:

StatementExpression:
  Assignment
  PreIncrementExpression
  PreDecrementExpression
  PostIncrementExpression
  PostDecrementExpression
  MethodInvocation
  ClassInstanceCreationExpression

      

Each of the alternatives listed for StatementExpression

also appears somewhere in the derivation tree for Expression

, where they were excluded from feature lists. Here's a very succinct sample:

Expression:
  LambdaExpression
  AssignmentExpression

AssignmentExpression:
  ConditionalExpression
  Assignment

Assignment:
  LeftHandSide AssignmentOperator Expression

...

UnaryExpression:
  PreIncrementExpression
  + UnaryExpression
  UnaryExpressionNotPlusMinus

PreIncrementExpression:
  ++ UnaryExpression

UnaryExpressionNotPlusMinus:
  PostfixExpression
  ~ UnaryExpression

PostfixExpression:
  Primary
  ExpressionName
  PostIncrementExpression

PostIncrementExpress:
  PostfixExpression ++

      

Notice how the non-terminals used on the right side ExpressionStatement

have special meanings at each priority level. In a C ++ grammar that doesn't restrict which expressions can be operators, there is no need for a separate Assignment

nonterminal:

assignment-expression:
  conditional-expression
  logical-or-expression assignment-operator initializer-clause

      



But in Java, this won't work. He needed to create additional non-terminal, which does not show ConditionalExpression

exactly that Assignment

was Statement

, and AssignmentExpression

equal Expression

.

2. Analysis of hierarchically named types in a declaration without conflicts with field access expressions in expressions

As above, hierarchical names (which must start with basic-name

) from other forms of field access expressions, which can start with any (other) , must be placed here primary-expr

. The former can be type names or primary expressions; the latter can only be type names. To make this distinction, we need to split production primary-expr

:

primary-expr : field-access-expr
             | nested-name

non-field-access-expr:
               call
             | post-increment-expression  // was primary-expr ++
             | ( expr )
             ...

field-access-expr :
               non-field-access-expr
             | field-access-expr . basic-name

      

3. Differences between using <as a comparison operator and as a template parenthesis.

Unlike the other two problems, it could actually be ambiguity in the language. For example, in C ++, template brackets are uniquely ambiguous; they can be resolved by knowing (or telling) whether a particular name is a template name or not. In Java, on the other hand, ambiguity is resolved by sometimes requiring type parameters to come before the common name. For example:

ConstructorDeclarator:
  [TypeParameters] SimpleTypeName ( [FormalParameterList] )

      

or

MethodInvocation:
  Primary . [TypeArguments] Identifier ( [ArgumentList] )

      

There is another rule in C # that requires looking at the character following the >one that potentially matches the opening <. The algorithm is described in section 7.6.4.2 of the C # manual; I have no idea how you would implement this in the LALR (1) parser.

+2


source







All Articles