Python regex for matching single-line and multi-line comments.
I am trying to create a python regex for PLY that will match form comments
// some comment
and
/* comment
more comment */
So i tried
t_COMMENT = r'//.+ | /\*.+\*/'
but this does not allow multi-line comments and when I try to solve this using the "dot matches all" options like
t_COMMENT = r'//.+ | (?s) /\*.+\*/'
this results in the comment type '//', which matches many lines. Also if I am trying to have two separate regex like
t_COMMENT = r'//.+'
t_COMMENT2 = r'(?s) /\*.+\*/'
The "//" comment type still matches multiple lines, as if period matched all parameters.
Does anyone know how to solve this?
source to share
Below regex will match both types of comments,
(?://[^\n]*|/\*(?:(?!\*/).)*\*/)
>>> s = """// some comment
...
... foo
... bar
... foobar
... /* comment
... more comment */ bar"""
>>> m = re.findall(r'(?://[^\n]*|/\*(?:(?!\*/).)*\*/)', s, re.DOTALL)
>>> m
['// some comment', '/* comment\n more comment */']
source to share
According to PLY Doc, this can be accomplished with "Conditional Lexing". It can be more readable and easier to debug than a complex regular expression. The example they give is a little more complex as it keeps track of the nesting levels and content within the block. However, your case is simpler because you don't need all this information.
The code for a multi-line comment should be something like this:
# I'd prefer 'multi_line_comment', but it appears that
# state names cannot have underscore in them
states = (
('multiLineComment','exclusive'),
)
def t_multiLineComment_start(t):
r'/\*'
t.lexer.begin('multiLineComment')
def t_multiLineComment_end):
r'\*/'
t.lexer.begin('INITIAL')
def t_multiLineComment_newline(t):
r'\n'
pass
# catch (and ignore) anything that isn't end-of-comment
def t_multiLineComment_content(t):
r'[^(\*/)]'
pass
Of course, for comments, //
you will need to have a different rule under the regular state.
source to share