Python regex for matching single-line and multi-line comments.

I am trying to create a python regex for PLY that will match form comments

// some comment

      

and

/* comment
   more comment */

      

So i tried

t_COMMENT = r'//.+ | /\*.+\*/'

      

but this does not allow multi-line comments and when I try to solve this using the "dot matches all" options like

t_COMMENT = r'//.+ | (?s) /\*.+\*/'

      

this results in the comment type '//', which matches many lines. Also if I am trying to have two separate regex like

t_COMMENT = r'//.+' 
t_COMMENT2 = r'(?s) /\*.+\*/'

      

The "//" comment type still matches multiple lines, as if period matched all parameters.

Does anyone know how to solve this?

+3


source to share


4 answers


Below regex will match both types of comments,

(?://[^\n]*|/\*(?:(?!\*/).)*\*/)

      



DEMO

>>> s = """// some comment
... 
... foo
... bar
... foobar
... /* comment
...    more comment */ bar"""
>>> m = re.findall(r'(?://[^\n]*|/\*(?:(?!\*/).)*\*/)', s, re.DOTALL)
>>> m
['// some comment', '/* comment\n   more comment */']

      

+3


source


According to PLY Doc, this can be accomplished with "Conditional Lexing". It can be more readable and easier to debug than a complex regular expression. The example they give is a little more complex as it keeps track of the nesting levels and content within the block. However, your case is simpler because you don't need all this information.

The code for a multi-line comment should be something like this:



# I'd prefer 'multi_line_comment', but it appears that 
# state names cannot have underscore in them
states = (
    ('multiLineComment','exclusive'),
)

def t_multiLineComment_start(t):
    r'/\*'
    t.lexer.begin('multiLineComment')          

def t_multiLineComment_end):
    r'\*/'
    t.lexer.begin('INITIAL')           

def t_multiLineComment_newline(t):
    r'\n'
    pass

# catch (and ignore) anything that isn't end-of-comment
def t_multiLineComment_content(t):
    r'[^(\*/)]'
    pass

      

Of course, for comments, //

you will need to have a different rule under the regular state.

+2


source


Here's a slight variation on Avinash's solution.

pat = re.compile(r'(?://.*?$)|(?:/\*.*?\*/)', re.M|re.S)

0


source


This might be helpful

 (/\*(.|\n)*?*/)|(//.*)

      

0


source







All Articles