LALR ambiguity resolution
I recently wrapped my head around LALR to write a LALR generator , and I am trying to build a java or C # -style grammar for it (which starts here ).
I know it's an effort to write a parser generator, like reinventing the wheel for example (why not use Antlr?), But my goal is to load a hobby subsystem that can compile itself, depending on a third party toolchain. My problem, though not with the generator, but with the grammar.
I am working on reducing / reducing ambiguities with operators and expressions.
I know how to resolve certain types of ambiguities such as dangling-else, but these few are not intuitive to me and they puzzled me.
What's the best way to solve these problems? Also, is there a prototyping tool that I can use to visualize the solution? Or if I go back to square and try to implement a GLR parser generator for grammar?
These statements are legal:
Generic.List<int> myVar1 = x + 4, myVar2; // stmt -> var-decl ;
// var-decl -> type-name var-decl-list
t = 99; // simple-stmt -> assign
i++; // simple-stmt -> incr-decr
// incr-decr -> primary-expr ++
json.deserialize<list<int>>(obj); // simple-stmt -> call
// call -> primary-expr ( params )
// ... -> primary-expr . basic-name ( params )
// ... -> basic-name . basic-name ( params )
This is how it is configured:
basic-name : ident < type-list >
| ident
nested-name : nested-name . basic-name
| basic-name
basic-type : int | bool | ...
type-name : nested-name
| basic-type
stmt : var-decl ;
| simple-stmt ;
| ...
var-decl : type-name var-decl-list
var-decl-list : var-decl-list , var-init
| var-init
var-init : ident assign-op expression
| ident
simple-stmt : assign
| call
| incr-decr
expr : assign-expr
assign-expr : assign
| cond-expr
assign : unary-expr assign-op expr
...
rel-expr : rel-expr < shift-expr
...
| shift-expr
...
unary-expr : unary-op primary-expr
| primary-expr
unary-op : + - ! ~ ++ -- // Prefix operators
| ( type-name ) // Conversion operator
primary-expr : call
| primary-expr . basic-name
| primary-expr ++
| ( expr )
...
| basic-name
call : primary-expr ( params )
incr-decr : primary-expr ++
| -- primary-expr
| ...
So when the parser is expecting a statement, the core of the element set * LR (k) is method-body -> { * stmts-opt }
, and the complete element given for the state looks something like this:
method-body -> { * stmts-opt }
stmts-opt -> * stmts
stmts-opt -> *
stmts -> * stmts stmt
stmt -> * var-decl ;
stmt -> * simple-stmt ;
var-decl -> * type-name var-decl-list
simple-stmt -> * assign
simple-stmt -> * call
simple-stmt -> * incr-decr
type-name -> * nested-name
type-name -> * basic-type
nested-name -> * nested-name . basic-name
nested-name -> * basic-name
basic-name -> * ident < type-list >
basic-name -> * ident
assign -> * unary-expr assign-op expr
unary-expr -> * unary-op primary-expr
unary-expr -> * primary-expr
unary-op -> * ( typename )
unary-op -> * ! | ~ | ...
primary-expr -> * call
primary-expr -> * primary-expr . basic-name
primary-expr -> * primary-expr ++
primary-expr -> * basic-name
primary-expr -> * ( expr )
call -> * primary-expr ( params )
incr-decr -> * primary-expr ++
...
When the identifier is shifted, the following state is:
basic-name -> ident *
basic-name -> ident * < type-list >
which is disassembled or reduced, and brings the following state:
nested-name -> basic-name * primary-expr -> basic-name *
Potential conflict. In the parent state, lookaheads do not help as there is a dot in nested-name
and primary-expr
. Oh, okay, let's try decreasing by nested name:
type-name -> nested-name *
nested-name -> nested-name * . basic-name
Can't see anything here ... Now, how about shortening to primary-expr
:
unary-expr -> primary-expr *
primary-expr -> primary-expr * . basic-name
primary-expr -> primary-expr * ++
call -> primary-expr * ( params )
incr-decr -> primary-expr * ++
...
Now when we shift ++ we get:
primary-expr -> primary-expr ++ * incr-decr -> primary-expr ++ *
... another conflict with decreasing decreasing.
Try moving (
instead ident
:
primary-expr -> ( * expr )
unary-op -> ( * type-name )
expr -> * assign-expr
assign-expr -> * assign
assign-expr -> * cond-expr
assign -> * unary-expr assign-op expr
unary-expr -> * unary-op primary-expr
unary-expr -> * primary-expr
unary-op -> * ( typename )
unary-op -> * ! | ~ | ...
primary-expr -> * call
primary-expr -> * primary-expr . basic-name
primary-expr -> * primary-expr ++
primary-expr -> * basic-name
primary-expr -> * ( expr )
call -> * primary-expr ( params )
cond-expr -> * ...
...
rel-expr -> * rel-expr < shift-expr
rel-expr -> * shift-expr
...
type-name -> * nested-name
type-name -> * basic-type
nested-name -> * nested-name . basic-name
nested-name -> * basic-name
basic-name -> * ident < type-list >
basic-name -> * ident
Same problems when you push ident
or (
stack.
These are just the ones I ran into. Since basic-name
takes precedence over rel-expr
, I'm guessing something x < n
would seem to be interpreted as basic-name -> ident < type-list *
errors if it were actually a relational expression.
My brain has reached the point where I can really help.
source to share
Your post has a couple of questions which makes it not ideal for SO. But I will try to give some thoughts on each of them. As I see it, you have three problems:
-
Differentiate expression expressions from expressions that are not instructions.
-
Parsing hierarchically named types in a declaration without conflicts with field access expressions in expressions
-
The difference between using <as a comparison operator and as a template parenthesis.
1. The difference between expression expressions and expressions that are not instructions.
As I understand it, the desire is to only allow operator expressions that have (or potentially have) some side effect: assignments, incremental mutations, and subroutine calls. Roughly speaking, this corresponds to Java, whose grammar includes:
StatementExpression:
Assignment
PreIncrementExpression
PreDecrementExpression
PostIncrementExpression
PostDecrementExpression
MethodInvocation
ClassInstanceCreationExpression
Each of the alternatives listed for StatementExpression
also appears somewhere in the derivation tree for Expression
, where they were excluded from feature lists. Here's a very succinct sample:
Expression:
LambdaExpression
AssignmentExpression
AssignmentExpression:
ConditionalExpression
Assignment
Assignment:
LeftHandSide AssignmentOperator Expression
...
UnaryExpression:
PreIncrementExpression
+ UnaryExpression
UnaryExpressionNotPlusMinus
PreIncrementExpression:
++ UnaryExpression
UnaryExpressionNotPlusMinus:
PostfixExpression
~ UnaryExpression
PostfixExpression:
Primary
ExpressionName
PostIncrementExpression
PostIncrementExpress:
PostfixExpression ++
Notice how the non-terminals used on the right side ExpressionStatement
have special meanings at each priority level. In a C ++ grammar that doesn't restrict which expressions can be operators, there is no need for a separate Assignment
nonterminal:
assignment-expression:
conditional-expression
logical-or-expression assignment-operator initializer-clause
But in Java, this won't work. He needed to create additional non-terminal, which does not show ConditionalExpression
exactly that Assignment
was Statement
, and AssignmentExpression
equal Expression
.
2. Analysis of hierarchically named types in a declaration without conflicts with field access expressions in expressions
As above, hierarchical names (which must start with basic-name
) from other forms of field access expressions, which can start with any (other) , must be placed here primary-expr
. The former can be type names or primary expressions; the latter can only be type names. To make this distinction, we need to split production primary-expr
:
primary-expr : field-access-expr
| nested-name
non-field-access-expr:
call
| post-increment-expression // was primary-expr ++
| ( expr )
...
field-access-expr :
non-field-access-expr
| field-access-expr . basic-name
3. Differences between using <as a comparison operator and as a template parenthesis.
Unlike the other two problems, it could actually be ambiguity in the language. For example, in C ++, template brackets are uniquely ambiguous; they can be resolved by knowing (or telling) whether a particular name is a template name or not. In Java, on the other hand, ambiguity is resolved by sometimes requiring type parameters to come before the common name. For example:
ConstructorDeclarator:
[TypeParameters] SimpleTypeName ( [FormalParameterList] )
or
MethodInvocation:
Primary . [TypeArguments] Identifier ( [ArgumentList] )
There is another rule in C # that requires looking at the character following the >one that potentially matches the opening <. The algorithm is described in section 7.6.4.2 of the C # manual; I have no idea how you would implement this in the LALR (1) parser.
source to share