Regex to extract multi-line hash comments

Let's take the following example:

  "data": {
    # Some information about field 1
    # on multiple lines
    "field1": "XXXXXXXXXX",

    # Some more info on a single line
    "field2": "XXXXXXXXXXX",

    "field3": "#this would be ignored"


From the above, I would like to extract the code comments together as a group, not separately. This grouping will happen if a line is commented out immediately after another line. Comments always start with a space followed by a #.


Capture group 1: Some information about field 1\n on multiple lines
Capture group 2: Some more info on a single line


I could step over the lines and evaluate the code, but it would be nice to use a regex if possible. If you feel that regex is not the right solution for this problem, please explain why.


You can use re.findall

with the following regex:

>>> m= re.findall(r'\s*#(.*)\s*#(.*)|#(.*)[^#]*',s,re.MULTILINE)
[(' Some information about field 1', ' on multiple lines', ''), ('', '', ' Some more info on a single line')]


And for printing, you can:

>>> for i,j in enumerate(m):
...   print ('group {}:{}'.format(i," & ".join([i for i in j if i])))
group 0: Some information about field 1 &  on multiple lines
group 1: Some more info on a single line


But as a more general way for more than 2 comment lines, you can use itertools.groupby


  "data": {
    # Some information about field 1
    # on multiple lines
    # threeeeeeeeecomment
    "field1": "XXXXXXXXXX"

    # Some more info on a single line
    "field2": "XXXXXXXXXXX",

    "field3": "#this would be ignored"
from itertools import groupby

comments =[[i for i in j if i.strip().startswith('#')] for _,j in groupby(s.split('\n'),lambda x: x.strip().startswith('#'))]

for i,j in enumerate([m for m in comments if m],1):
        l=[t.strip(' #') for t in j]
        print 'group {} :{}'.format(i,' & '.join(l))



group 1 :Some information about field 1 & on multiple lines & threeeeeeeeecomment
group 2 :Some more info on a single line




Let's say for example you want to take some specific data from a multi-line string on each line with one regex (like hashtags):

#!/usr/bin/env python
# coding: utf-8

import re

# the regexp isn't 100% accurate, but you'll get the point
# groups followed by '?' match if repeated 0 or 1 times.
regexp = re.compile('^.*(#[a-z]*).*(#[a-z]*)?$')

multiline_string = '''
                     The awesomeness of #MotoGP is legendary. #Bikes rock!
                     Awesome racing car #HeroComesHome epic

iterable_list = multiline_string.splitlines()

for line in iterable_list:
    Keep in mind:   if group index is out of range,
                    execution will crash with an error.
                    You can prevent it with try/except blocks
    fragments = regexp.match(line)
    frag_in_str =

    # Example to prevent a potential IndexError:
        some_other_subpattern =
    except IndexError:
        some_other_subpattern = ''

    entire_match =


Each group within the brackets can be extracted this way.

A good example can be found here to negate patterns: How to cancel a specific word in a regex?



You can use deque to keep two lines and add some logic to separate comments in blocks:

  "data": {
    # Some information about field 1
    # on multiple lines
    "field1": "XXXXXXXXXX",

    # Some more info on a single line
    "field2": "XXXXXXXXXXX",

    # multiple line comments
    # supported
    # as well 
    "field3": "#this would be ignored"


from collections import deque
d=deque([], 2)
for line in src.splitlines():
    if d[-1].startswith('#'):        
        if d[0].startswith('#'):
    elif d[0].startswith('#'):

for i, b in enumerate(blocks):
    print 'block {}: \n{}'.format(i, '\n'.join(b))  



block 0: 
 Some information about field 1
 on multiple lines
block 1: 
 Some more info on a single line
block 2: 
 multiple line comments
 as well 




Can't do cleanly with regexes, but you can get away with a single liner)

import re

str = """{
  "data": {
    # Some information about field 1
    # on multiple lines
    "field1": "XXXXXXXXXX",

    # Some more info on a single line
    "field2": "XXXXXXXXXXX"
    # Some information about field 1
    # on multiple lines
    # Some information about field 1
    # on multiple lines
    "field3": "#this would be ignored"

rex = re.compile("(^(?!\s*#.*?[\r\n]+)(.*?)([\r\n]+|$)|[\r\n]*^\s*#\s*)+", re.MULTILINE)    
print rex.sub("\n", str).strip().split('\n\n')



['Some information about field 1\non multiple lines', 'Some more info on a single line', 'Some information about field 1\non multiple lines\nSome information about field 1\non multiple lines']




