What's wrong with this Bison grammar?

Question

What's wrong with this Bison grammar?

I am trying to build a Bison grammar and it seems to be missing something. I've kept it still very simple, but I'm getting a syntax error and can't figure out why:

Here is my Bison Code:

%{

#include <stdlib.h>
#include <stdio.h>

int yylex(void);
int yyerror(char *s);

%}

// Define the types flex could return
%union {
    long lval;
    char *sval;
}

// Define the terminal symbol token types
%token <sval> IDENT;
%token <lval> NUM;

%%

Program: 
    Def ';' 
    ;

Def: 
    IDENT '=' Lambda { printf("Successfully parsed file"); }
    ;

Lambda: 
    "fun" IDENT "->" "end"
    ;

%%

main() {
    yyparse();
    return 0;
}

int yyerror(char *s)
{
  extern int yylineno;  // defined and maintained in flex.flex
  extern char *yytext;  // defined and maintained in flex.flex

  printf("ERROR: %s at symbol \"%s\" on line %i", s, yytext, yylineno); 
  exit(2);
}

Here is my Flex code

%{
#include <stdlib.h>
#include "bison.tab.h"
%}

ID [A-Za-z][A-Za-z0-9]*
NUM [0-9][0-9]*
HEX [$][A-Fa-f0-9]+
COMM [/][/].*$

%%

fun|if|then|else|let|in|not|head|tail|and|end|isnum|islist|isfun    {
    printf("Scanning a keyword\n");
}


{ID}    {
    printf("Scanning an IDENT\n");
    yylval.sval =  strdup( yytext );
    return IDENT;
}

{NUM}   {
    printf("Scanning a NUM\n");
    /* Convert into long to loose leading zeros */
    char *ptr = NULL;
    long num = strtol(yytext, &ptr, 10);
    if( errno == ERANGE ) {
            printf("Number was to big");
            exit(1);
    }

    yylval.lval = num;
    return NUM;
}

{HEX}   {
    printf("Scanning a NUM\n");
    char *ptr = NULL;
    /* convert hex into decimal using offset 1 because of the $ */
    long num = strtol(&yytext[1], &ptr, 16);
    if( errno == ERANGE ) {
            printf("Number was to big");
            exit(1);
    }

    yylval.lval = num;
    return NUM;
}


";"|"="|"+"|"-"|"*"|"."|"<"|"="|"("|")"|"->" {
    printf("Scanning an operator\n");
}

[ \t\n]+ /* eat up whitespace */


{COMM}* /* eat up one-line comments */

.   {
    printf("Unrecognized character: %s at linenumber %d\n", yytext, yylineno );
    exit(1);
}

%%

And here's my Makefile :

all:    parser

parser: bison flex
    gcc bison.tab.c lex.yy.c -o parser -lfl

bison:  bison.y
    bison -d bison.y

flex:   flex.flex
    flex flex.flex

clean:
    rm bison.tab.h
    rm bison.tab.c
    rm lex.yy.c
    rm parser

Everything compiles just fine, I don't get any errors makenin does.

Here is my test file

f = fun x -> end;

And here's the output:

./parser < a0.0
Scanning an IDENT
Scanning an operator
Scanning a keyword
Scanning an IDENT
ERROR: syntax error at symbol "x" on line 1

since x appears to be recognized as IDENT, the rule should be correct, but I am getting a syntax error.

I feel like I am missing something important, I hope someone can help me.

Thanks in advance!

EDIT:

I tried deleting IDENT

in the Lambda rule and test file, now it seems to run through the line but still throws

ERROR: syntax error at symbol "" on line 1

after EOF.

+3

c flex-lexer bison

Jakob abfalter Apr 20 15 at 13:35

source to share

1 answer

rici · Accepted Answer · 2015-04-20T16:08:43+0000

Your scanner recognizes the keywords (and prints out the debug line, but see below), but it doesn't bother reporting everything to the parser. Therefore, they are effectively ignored.

In your bison definition file, you use (for example) "fun" as the terminal, but you do not provide the terminal with a name that can be used in the scanner. The scanner needs this name because it needs to return the token ID to the parser.

To summarize, you need something like this:

In your grammar, before %%

:

token T_FUN "fun"
token T_IF "if"
token T_THEN "then"
 /* Etc. */

In your scanner definition:

fun { return T_FUN; }
if  { return T_IF; }
then { return T_THEN; }
 /* Etc. */

A few other notes:

Your scanner's rule for recognizing operators also returns nothing, so operators will be ignored as well. This is clearly undesirable. flex and bison allow for a simpler solution for single-character operators, allowing a character to be their own marker ID. This avoids creating a marker name. In the parser, a single-cable character is a token token whose value is a character; which is very different from the double quoted string, which is an alias for the declared token name. Therefore, you can do this:
```
"=" { return '='; }
 /* Etc. */

      

        
        
        
      

    
```
but it's easier to make all single-character tokens at once:
```
[;+*.<=()-]  { return yytext[0]; }

      

        
        
        
      

    
```
and it's even easier to use the default rule at the end:
```
. { return yytext[0]; }

      

        
        
        
      

    
```
which will have the effect of handling unrecognized characters, returning an unknown token id to the parser, resulting in a syntax error.

This will not work for "->" as it is not a single character token that needs to be treated in the same way as keywords.
Flex will generate debug output automatically if you use a flag -d

when creating a scanner. It's much easier than inserting your own debug printout because you can turn it off by simply removing the option -d

. (You can use this instead %option debug

if you don't want to change the flex call in your file.) This is also better because it provides consistent information, including location information.
Some minor points:
- Sample is [0-9][0-9]*
  
  easier to write[0-9]+
- The comment pattern "//".*
  
  does not require a $
  
  lookahead at the end, as .*
  
  it will always match the longest sequence of non-newline characters; therefore, the first unmatched character must be either a new line or EOF. $
  
  lookahead will not match if template terminated with EOF, which will cause odd errors if the file ends with a comment without a newline at the end.
- There is no point in using {COMM}*
  
  it because the comment pattern does not match the newline that ends the comment, so it is impossible to match two consecutive comments. But anyway, after matching the comment and the next newline, flex will continue to match the next comment, so it {COMM}
  
  suffices. (Personally, I wouldn't use an acronym COMM
  
  , it really doesn't add anything to readability, IMHO.)

What's wrong with this Bison grammar?

More articles: