Simple ANTLR preprocessor
I am trying to create a simple preprocessor in ANTLR. My grammar looks like this:
grammar simple_preprocessor;
ifdef_statement : POUND_IFDEF IDENTIFIER ;
else_statement : POUND_ELSE ;
endif_statement : POUND_ENDIF ;
preprocessor_statement :
ifdef_statement
code_block
else_statement
code_block
endif_statement
;
code_file : (preprocessor_statement | code_block)+ EOF ;
code_block : TEXT ;
POUND_IFDEF : '#IFDEF';
POUND_ELSE : '#ELSE';
POUND_ENDIF : '#ENDIF';
IDENTIFIER : ID_START ID_CONTINUE* ;
TEXT : ~[\u000C]+ ;
fragment ID_START : '_' | [A-Z] | [a-z] ;
fragment ID_CONTINUE : ID_START | [0-9] ;
WS : [ \t\r\n\u000C]+ -> channel(HIDDEN) ;
Then I parse the following using the code_file () rule:
#IFDEF one
print "1"
#ELSE
print "2"
#ENDIF
The row tree looks like this:
(code_file (code_block \n#IFDEF one\n print "1"\n#ELSE\n print "2"\n#ENDIF\n) <EOF>)
Not what I want, because preprocessor tokens are processed as text and follow the code_block rule.
I read the chapter "Islands in the Stream" in the ANTLR book and the XML example makes sense, but it uses TEXT without two specific characters:
TEXT : ~[<&]+ ;
If I really should, I suppose I could exclude the # character:
TEXT : ~[#]+ ;
But I hope there is a better way to tell ANTLR to exclude my preprocessor tokens so that it can distinguish them from common code.
Thanks for any help.
Use lexical mode to separate preprocessor directives from the plain textual definition of your base grammar. Use \n#
and next \n
as safety devices.
PStart : '\n#' -> channel(HIDDEN), pushMode(PreProc) ;
mode PreProc ;
PIFDEF : 'IFDEF' PTEXT* ;
PELSE : 'ELSE' ;
PENDIF : 'ENDIF' ;
PTEXT : [a-zA-Z0-9_-]+ ;
PEOL : [\r\n]+ -> channel(HIDDEN), popMode ;
PWS : [ \t]+ -> channel(HIDDEN) ;
// maybe PCOMMENT ?
Update - Consolidate the full text of the directives into single tokens:
PIFDEF : 'IFDEF' PTEXT* PEOL -> popMode ;
PELSE : 'ELSE' PEOL -> popMode ;
PENDIF : 'ENDIF' PEOL -> popMode ;
PTEXT : [ \ta-zA-Z0-9_-]+ ;
PEOL : [\r\n] ;
This is usually not the direction you want to go - generally, you want more decomposition, not less. For example, it might be better, but still create visible EOLs.
mode PreProc ;
PIFDEF : 'IFDEF' ;
PELSE : 'ELSE' ;
PENDIF : 'ENDIF' ;
PTEXT : [a-zA-Z0-9_-]+ ;
PEOL : '\r'? '\n' -> popMode ;
PWS : [ \t]+ -> channel(HIDDEN) ;
PCMT : '//' ~[\r\n]* -> channel(HIDDEN) ;
Thus, the preproc tokens are discrete, and a sequence of one or more PTEXTs contains only the preproc ID. Emitting PEOLs seems like overkill, but not necessarily wrong. Parser rules for demonstration:
preproc : ifdef | else | endif ;
ifdef : PIFDEF PTEXT+ PEOL ; // the rules are unambiguous
else : PELSE PEOL ; // even without matching the PEOLs
endif : PENDIF PEOL ;