Simple ANTLR preprocessor

I am trying to create a simple preprocessor in ANTLR. My grammar looks like this:

grammar simple_preprocessor;

ifdef_statement : POUND_IFDEF IDENTIFIER ;
else_statement : POUND_ELSE ;
endif_statement : POUND_ENDIF ;

preprocessor_statement :
    ifdef_statement
        code_block
    else_statement
        code_block
    endif_statement
    ;

code_file : (preprocessor_statement | code_block)+ EOF ;

code_block : TEXT ;

POUND_IFDEF : '#IFDEF';
POUND_ELSE : '#ELSE';
POUND_ENDIF : '#ENDIF';

IDENTIFIER : ID_START ID_CONTINUE* ;

TEXT : ~[\u000C]+ ;

fragment ID_START : '_' | [A-Z] | [a-z] ;
fragment ID_CONTINUE : ID_START | [0-9] ;

WS  :  [ \t\r\n\u000C]+ -> channel(HIDDEN) ;

      

Then I parse the following using the code_file () rule:

#IFDEF one
    print "1"
#ELSE
    print "2"
#ENDIF

      

The row tree looks like this:

(code_file (code_block \n#IFDEF one\n    print "1"\n#ELSE\n    print "2"\n#ENDIF\n) <EOF>)

      

Not what I want, because preprocessor tokens are processed as text and follow the code_block rule.

I read the chapter "Islands in the Stream" in the ANTLR book and the XML example makes sense, but it uses TEXT without two specific characters:

TEXT : ~[<&]+ ;

      

If I really should, I suppose I could exclude the # character:

TEXT : ~[#]+ ;

      

But I hope there is a better way to tell ANTLR to exclude my preprocessor tokens so that it can distinguish them from common code.

Thanks for any help.

+3


source to share


1 answer


Use lexical mode to separate preprocessor directives from the plain textual definition of your base grammar. Use \n#

and next \n

as safety devices.

PStart : '\n#' -> channel(HIDDEN), pushMode(PreProc) ;

mode PreProc ;

PIFDEF : 'IFDEF' PTEXT* ;
PELSE  : 'ELSE'  ;
PENDIF : 'ENDIF' ;
PTEXT  : [a-zA-Z0-9_-]+ ;
PEOL   : [\r\n]+       -> channel(HIDDEN), popMode ;
PWS    : [ \t]+        -> channel(HIDDEN) ;
// maybe PCOMMENT ?

      

Update - Consolidate the full text of the directives into single tokens:

PIFDEF : 'IFDEF' PTEXT* PEOL -> popMode ;
PELSE  : 'ELSE'  PEOL -> popMode ;
PENDIF : 'ENDIF' PEOL -> popMode ;

PTEXT  : [ \ta-zA-Z0-9_-]+ ;
PEOL   : [\r\n]  ;

      



This is usually not the direction you want to go - generally, you want more decomposition, not less. For example, it might be better, but still create visible EOLs.

mode PreProc ;

PIFDEF : 'IFDEF' ;
PELSE  : 'ELSE'  ;
PENDIF : 'ENDIF' ;
PTEXT  : [a-zA-Z0-9_-]+ ;
PEOL   : '\r'? '\n'    -> popMode ;
PWS    : [ \t]+        -> channel(HIDDEN) ;
PCMT   : '//' ~[\r\n]* -> channel(HIDDEN) ;

      

Thus, the preproc tokens are discrete, and a sequence of one or more PTEXTs contains only the preproc ID. Emitting PEOLs seems like overkill, but not necessarily wrong. Parser rules for demonstration:

preproc : ifdef | else | endif ;
ifdef   : PIFDEF PTEXT+ PEOL   ; // the rules are unambiguous
else    : PELSE  PEOL          ; // even without matching the PEOLs
endif   : PENDIF PEOL          ;

      

+2


source







All Articles