Regex to extract multi-line hash comments

Question

Regex to extract multi-line hash comments

The writers block is currently struggling to come up with an elegant solution to this problem.

Let's take the following example:

{
  "data": {
    # Some information about field 1
    # on multiple lines
    "field1": "XXXXXXXXXX",

    # Some more info on a single line
    "field2": "XXXXXXXXXXX",

    "field3": "#this would be ignored"
  }
}

From the above, I would like to extract the code comments together as a group, not separately. This grouping will happen if a line is commented out immediately after another line. Comments always start with a space followed by a #.

Result:

Capture group 1: Some information about field 1\n on multiple lines
Capture group 2: Some more info on a single line

I could step over the lines and evaluate the code, but it would be nice to use a regex if possible. If you feel that regex is not the right solution for this problem, please explain why.

SUMMARY:

Thanks everyone for submitting various solutions to this problem, this is a prime example of how helpful the SO community is. I will spend an hour of my time responding to other tickets to make up for the collective time spent on this.

Hopefully this thread will also help others in the future.

+3

python regex

sleepycal 06 May '15 at 20:18

source to share

4 answers

Let's say for example you want to take some specific data from a multi-line string on each line with one regex (like hashtags):

#!/usr/bin/env python
# coding: utf-8

import re

# the regexp isn't 100% accurate, but you'll get the point
# groups followed by '?' match if repeated 0 or 1 times.
regexp = re.compile('^.*(#[a-z]*).*(#[a-z]*)?$')

multiline_string = '''
                     The awesomeness of #MotoGP is legendary. #Bikes rock!
                     Awesome racing car #HeroComesHome epic
'''

iterable_list = multiline_string.splitlines()

for line in iterable_list:
    '''
    Keep in mind:   if group index is out of range,
                    execution will crash with an error.
                    You can prevent it with try/except blocks
    '''
    fragments = regexp.match(line)
    frag_in_str = fragments.group(1)

    # Example to prevent a potential IndexError:
    try:
        some_other_subpattern = fragments.group(2)
    except IndexError:
        some_other_subpattern = ''

    entire_match = fragments.group(0)

Each group within the brackets can be extracted this way.

A good example can be found here to negate patterns: How to cancel a specific word in a regex?

+1

SebasSBM 06 May '15 at 20:43

source to share

You can use deque to keep two lines and add some logic to separate comments in blocks:

src='''\
{
  "data": {
    # Some information about field 1
    # on multiple lines
    "field1": "XXXXXXXXXX",

    # Some more info on a single line
    "field2": "XXXXXXXXXXX",


    # multiple line comments
    # supported
    # as well 
    "field3": "#this would be ignored"

  }
}
'''

from collections import deque
d=deque([], 2)
blocks=[]
for line in src.splitlines():
    d.append(line.strip())
    if d[-1].startswith('#'):        
        comment=line.partition('#')[2]
        if d[0].startswith('#'):
            block.append(comment)
        else:
            block=[comment]
    elif d[0].startswith('#'):
        blocks.append(block)

for i, b in enumerate(blocks):
    print 'block {}: \n{}'.format(i, '\n'.join(b))

Printing

block 0: 
 Some information about field 1
 on multiple lines
block 1: 
 Some more info on a single line
block 2: 
 multiple line comments
 supported
 as well

+1

dawg 06 May '15 at 21:07

source to share

Can't do cleanly with regexes, but you can get away with a single liner)

import re

str = """{
  "data": {
    # Some information about field 1
    # on multiple lines
    "field1": "XXXXXXXXXX",

    # Some more info on a single line
    "field2": "XXXXXXXXXXX"
    # Some information about field 1
    # on multiple lines
    # Some information about field 1
    # on multiple lines
    "field3": "#this would be ignored"
  }
}"""

rex = re.compile("(^(?!\s*#.*?[\r\n]+)(.*?)([\r\n]+|$)|[\r\n]*^\s*#\s*)+", re.MULTILINE)    
print rex.sub("\n", str).strip().split('\n\n')

Outputs:

['Some information about field 1\non multiple lines', 'Some more info on a single line', 'Some information about field 1\non multiple lines\nSome information about field 1\non multiple lines']

+1

SanD May 06 '15 at 21:51

source to share

Kasramvd · Accepted Answer · 2015-05-06T20:43:52+0000

You can use re.findall

with the following regex:

>>> m= re.findall(r'\s*#(.*)\s*#(.*)|#(.*)[^#]*',s,re.MULTILINE)
[(' Some information about field 1', ' on multiple lines', ''), ('', '', ' Some more info on a single line')]

And for printing, you can:

>>> for i,j in enumerate(m):
...   print ('group {}:{}'.format(i," & ".join([i for i in j if i])))
... 
group 0: Some information about field 1 &  on multiple lines
group 1: Some more info on a single line

But as a more general way for more than 2 comment lines, you can use itertools.groupby

:

s="""{
  "data": {
    # Some information about field 1
    # on multiple lines
    # threeeeeeeeecomment
    "field1": "XXXXXXXXXX"

    # Some more info on a single line
    "field2": "XXXXXXXXXXX",

    "field3": "#this would be ignored"
  }
}"""
from itertools import groupby

comments =[[i for i in j if i.strip().startswith('#')] for _,j in groupby(s.split('\n'),lambda x: x.strip().startswith('#'))]

for i,j in enumerate([m for m in comments if m],1):
        l=[t.strip(' #') for t in j]
        print 'group {} :{}'.format(i,' & '.join(l))

result:

group 1 :Some information about field 1 & on multiple lines & threeeeeeeeecomment
group 2 :Some more info on a single line

Regex to extract multi-line hash comments

More articles: