LALR ambiguity resolution

Question

LALR ambiguity resolution

I recently wrapped my head around LALR to write a LALR generator , and I am trying to build a java or C # -style grammar for it (which starts here ).

I know it's an effort to write a parser generator, like reinventing the wheel for example (why not use Antlr?), But my goal is to load a hobby subsystem that can compile itself, depending on a third party toolchain. My problem, though not with the generator, but with the grammar.

I am working on reducing / reducing ambiguities with operators and expressions.

I know how to resolve certain types of ambiguities such as dangling-else, but these few are not intuitive to me and they puzzled me.

What's the best way to solve these problems? Also, is there a prototyping tool that I can use to visualize the solution? Or if I go back to square and try to implement a GLR parser generator for grammar?

These statements are legal:

Generic.List<int> myVar1 = x + 4, myVar2; // stmt -> var-decl ;
                                          // var-decl -> type-name var-decl-list

t = 99;                           // simple-stmt -> assign

i++;                              // simple-stmt -> incr-decr
                                  // incr-decr -> primary-expr ++

json.deserialize<list<int>>(obj); // simple-stmt -> call
                                  // call -> primary-expr ( params )
                                  // ...  -> primary-expr . basic-name ( params )
                                  // ...  -> basic-name . basic-name ( params )

This is how it is configured:

basic-name : ident < type-list >
           | ident

nested-name : nested-name . basic-name
            | basic-name

basic-type : int | bool | ...

type-name : nested-name
          | basic-type

stmt : var-decl ;
     | simple-stmt ;
     | ...

var-decl : type-name var-decl-list

var-decl-list : var-decl-list , var-init
              | var-init

var-init : ident assign-op expression
         | ident

simple-stmt : assign
            | call
            | incr-decr

expr : assign-expr

assign-expr : assign
            | cond-expr

assign : unary-expr assign-op expr

...
rel-expr : rel-expr < shift-expr
         ...
         | shift-expr

...
unary-expr : unary-op primary-expr
           | primary-expr

unary-op : + - ! ~ ++ --  // Prefix operators
         | ( type-name )  // Conversion operator

primary-expr : call
             | primary-expr . basic-name
             | primary-expr ++
             | ( expr )
             ...
             | basic-name

call : primary-expr ( params )

incr-decr : primary-expr ++
          | -- primary-expr
          | ...

So when the parser is expecting a statement, the core of the element set * LR (k) is method-body -> { * stmts-opt }

, and the complete element given for the state looks something like this:

method-body -> { * stmts-opt }
stmts-opt -> * stmts
stmts-opt -> *
stmts -> * stmts stmt
stmt -> * var-decl ;
stmt -> * simple-stmt ;
var-decl -> * type-name var-decl-list
simple-stmt -> * assign
simple-stmt -> * call
simple-stmt -> * incr-decr
type-name -> * nested-name
type-name -> * basic-type
nested-name -> * nested-name . basic-name
nested-name -> * basic-name
basic-name -> * ident < type-list >
basic-name -> * ident
assign -> * unary-expr assign-op expr
unary-expr -> * unary-op primary-expr
unary-expr -> * primary-expr
unary-op -> * ( typename )
unary-op -> * ! | ~ | ...
primary-expr -> * call
primary-expr -> * primary-expr . basic-name
primary-expr -> * primary-expr ++
primary-expr -> * basic-name
primary-expr -> * ( expr )
call -> * primary-expr ( params )
incr-decr -> * primary-expr ++
...

When the identifier is shifted, the following state is:

basic-name -> ident *
basic-name -> ident * < type-list >

which is disassembled or reduced, and brings the following state:

nested-name -> basic-name *
primary-expr -> basic-name *

Potential conflict. In the parent state, lookaheads do not help as there is a dot in nested-name

and primary-expr

. Oh, okay, let's try decreasing by nested name:

type-name -> nested-name *
nested-name -> nested-name * . basic-name

Can't see anything here ... Now, how about shortening to primary-expr

:

unary-expr -> primary-expr *
primary-expr -> primary-expr * . basic-name
primary-expr -> primary-expr * ++
call -> primary-expr * ( params )
incr-decr -> primary-expr * ++
...

Now when we shift ++ we get:

primary-expr -> primary-expr ++ *
incr-decr -> primary-expr ++ *

... another conflict with decreasing decreasing.

Try moving (

instead ident

:

primary-expr -> ( * expr )
unary-op -> ( * type-name )
expr -> * assign-expr
assign-expr -> * assign
assign-expr -> * cond-expr
assign -> * unary-expr assign-op expr
unary-expr -> * unary-op primary-expr
unary-expr -> * primary-expr
unary-op -> * ( typename )
unary-op -> * ! | ~ | ...
primary-expr -> * call
primary-expr -> * primary-expr . basic-name
primary-expr -> * primary-expr ++
primary-expr -> * basic-name
primary-expr -> * ( expr )
call -> * primary-expr ( params )
cond-expr -> * ...
...
rel-expr -> * rel-expr < shift-expr
rel-expr -> * shift-expr
...
type-name -> * nested-name
type-name -> * basic-type
nested-name -> * nested-name . basic-name
nested-name -> * basic-name
basic-name -> * ident < type-list >
basic-name -> * ident

Same problems when you push ident

or (

stack.

These are just the ones I ran into. Since basic-name

takes precedence over rel-expr

, I'm guessing something x < n

would seem to be interpreted as basic-name -> ident < type-list *

errors if it were actually a relational expression.

My brain has reached the point where I can really help.

+3

compiler-construction parsing grammar lalr

Josh wyant 09 Aug At 23:54

source to share

1 answer

rici · Accepted Answer · 2015-08-10T20:54:20+0000

Your post has a couple of questions which makes it not ideal for SO. But I will try to give some thoughts on each of them. As I see it, you have three problems:

Differentiate expression expressions from expressions that are not instructions.
Parsing hierarchically named types in a declaration without conflicts with field access expressions in expressions
The difference between using <as a comparison operator and as a template parenthesis.

1. The difference between expression expressions and expressions that are not instructions.

As I understand it, the desire is to only allow operator expressions that have (or potentially have) some side effect: assignments, incremental mutations, and subroutine calls. Roughly speaking, this corresponds to Java, whose grammar includes:

StatementExpression:
  Assignment
  PreIncrementExpression
  PreDecrementExpression
  PostIncrementExpression
  PostDecrementExpression
  MethodInvocation
  ClassInstanceCreationExpression

Each of the alternatives listed for StatementExpression

also appears somewhere in the derivation tree for Expression

, where they were excluded from feature lists. Here's a very succinct sample:

Expression:
  LambdaExpression
  AssignmentExpression

AssignmentExpression:
  ConditionalExpression
  Assignment

Assignment:
  LeftHandSide AssignmentOperator Expression

...

UnaryExpression:
  PreIncrementExpression
  + UnaryExpression
  UnaryExpressionNotPlusMinus

PreIncrementExpression:
  ++ UnaryExpression

UnaryExpressionNotPlusMinus:
  PostfixExpression
  ~ UnaryExpression

PostfixExpression:
  Primary
  ExpressionName
  PostIncrementExpression

PostIncrementExpress:
  PostfixExpression ++

Notice how the non-terminals used on the right side ExpressionStatement

have special meanings at each priority level. In a C ++ grammar that doesn't restrict which expressions can be operators, there is no need for a separate Assignment

nonterminal:

assignment-expression:
  conditional-expression
  logical-or-expression assignment-operator initializer-clause

But in Java, this won't work. He needed to create additional non-terminal, which does not show ConditionalExpression

exactly that Assignment

was Statement

, and AssignmentExpression

equal Expression

.

2. Analysis of hierarchically named types in a declaration without conflicts with field access expressions in expressions

As above, hierarchical names (which must start with basic-name

) from other forms of field access expressions, which can start with any (other) , must be placed here primary-expr

. The former can be type names or primary expressions; the latter can only be type names. To make this distinction, we need to split production primary-expr

:

primary-expr : field-access-expr
             | nested-name

non-field-access-expr:
               call
             | post-increment-expression  // was primary-expr ++
             | ( expr )
             ...

field-access-expr :
               non-field-access-expr
             | field-access-expr . basic-name

3. Differences between using `<`as a comparison operator and as a template parenthesis.

Unlike the other two problems, it could actually be ambiguity in the language. For example, in C ++, template brackets are uniquely ambiguous; they can be resolved by knowing (or telling) whether a particular name is a template name or not. In Java, on the other hand, ambiguity is resolved by sometimes requiring type parameters to come before the common name. For example:

ConstructorDeclarator:
  [TypeParameters] SimpleTypeName ( [FormalParameterList] )

or

MethodInvocation:
  Primary . [TypeArguments] Identifier ( [ArgumentList] )

There is another rule in C # that requires looking at the character following the >one that potentially matches the opening <. The algorithm is described in section 7.6.4.2 of the C # manual; I have no idea how you would implement this in the LALR (1) parser.

LALR ambiguity resolution

1. The difference between expression expressions and expressions that are not instructions.

2. Analysis of hierarchically named types in a declaration without conflicts with field access expressions in expressions

3. Differences between using <as a comparison operator and as a template parenthesis.

More articles:

3. Differences between using `<`as a comparison operator and as a template parenthesis.