How to handle nested comments in antlr lexer
How to handle nested comments in antlr4 lexer? those. I need to count the number of "/ *" inside this token and close only after the same amount of "* /" has been received. For example, the D language has nested comments like "/ + ... + /"
For example, the following lines should be treated as one comment block:
/* comment 1
comment 2
/* comment 3
comment 4
*/
// comment 5
comment 6
*/
My current code is as follows and it doesn't work on the above nested comment:
COMMENT : '/*' .*? '*/' -> channel(HIDDEN)
;
LINE_COMMENT : '//' ~('\n'|'\r')* '\r'? '\n' -> channel(HIDDEN)
;
source to share
Terence Parr has these two lexer lines in the Swift Antlr4 grammar for lexing out nested comments:
COMMENT : '/*' (COMMENT|.)*? '*/' -> channel(HIDDEN) ;
LINE_COMMENT : '//' .*? '\n' -> channel(HIDDEN) ;
source to share
I use:
COMMENT: '/*' ('/'*? COMMENT | ('/'* | '*'*) ~[/*])*? '*'*? '*/' -> skip;
This causes anyone /*
inside a comment to be considered the start of a nested comment and similar */
. In other words, there is no way to recognize /*
it */
differently than at the beginning and at the end of the rule COMMENT
.
So something like /* /* /* */ a */
will not be fully recognized as a (bad) comment (inconsistency /*
and */
s) as if when used COMMENT: '/*' (COMMENT|.)*? '*/' -> skip;
but how /
followed *
, followed by correct nested comments /* /* */ a */
.
source to share
Works for Antlr3.
Allows nested comments and '*' in comments.
fragment
F_MultiLineCommentTerm
:
( {LA(1) == '*' && LA(2) != '/'}? => '*'
| {LA(1) == '/' && LA(2) == '*'}? => F_MultiLineComment
| ~('*')
)*
;
fragment
F_MultiLineComment
:
'/*'
F_MultiLineCommentTerm
'*/'
;
H_MultiLineComment
: r= F_MultiLineComment
{ $channel=HIDDEN;
printf(stder,"F_MultiLineComment[\%s]",$r->getText($r)->chars);
}
;
source to share
I can give you an ANTLR3 solution that you can configure to work in ANTLR4:
I think you can use a recursive rule call. Execute a non-greedy comment rule for / * ... * / that calls itself. This should allow unlimited nesting, not counting the opening + closing comment markers:
COMMENT option { greedy = false; }:
('/*' ({LA(1) == '/' && LA(2) == '*'} => COMMENT | .) .* '*/') -> channel(HIDDEN)
;
or maybe even:
COMMENT option { greedy = false; }:
('/*' .* COMMENT? .* '*/') -> channel(HIDDEN)
;
I'm not sure if ANTLR is choosing the correct path depending on any char or comment commenter. Try it.
source to share
- This will handle: '/ * / * /' and '/*.../*/, where the comment body is' /' and '... /' respectively.
- Multi-line comments will not be nested inside comments on the same line, so you cannot start and start multi-line comments within a single line comment.
- This is an invalid comment: '/ * // * /'.
- You need a newline to end a single line comment before "* /" can be used to end a multiline comment.
- This is a valid comment: '/ * // * / \ n / * /'.
- Comment body: '// * / \ n /'. As you can see, the complete single line comment is included in the multiline comment body.
- Although "/ * /" may end a multiline comment, if the preceding character is "*", the comment ends with the first "/" and the remaining "* /" must end a nested comment, otherwise there is an error. The shortest path wins, it's not greedy!
- This is an invalid comment / **** / * /
- This is a valid comment / * / **** / * /, the body of the comment is / **** /, which is itself a nested comment.
- Prefix and suffix will never match in a multiline comment tag.
- If you want to implement this for the "D" language, change the "*" to "+".
COMMENT_NEST
: '/*'
( ('/'|'*'+)? ~[*/] | COMMENT_NEST | COMMENT_INL )*?
('/'|'*'+?)?
'*/'
;
COMMENT_INL
: '//' ( COMMENT_INL | ~[\n\r] )*
;
source to share