Extract items separated by square brackets using python regular expressions

I am trying to split words / phrases separated by square brackets using python regex. I want to split the output. The conditions are that the section of text starting and ending with square brackets will be split into another element.

This is what I have so far, but it doesn't work as expected:

import re
t="word1 word2 3456 [abc def] [ghi jkl] [1234] [-abcd] word 2345"
re.split("(\[)(.*)(\])+",t)

      

Output:

['word1 word2 3456 ',
'[',
'abc def] [ghi jkl] [1234] [-abcd',
']',
' word [xyz 2345']

      

I want the output to be something like:

['word1 word2 3456 ',
 '[abc def]',
 ' ',
 '[ghi jkl]',
 ' ',
 '[1234]',
 ' ',
 '[-abcd]',
 ' word [xyz 2345']

      

Note that only the open and close brackets are separated.

I also tried this:

re.split("(\[.*\])+",t)

      

but only breaks into the first and last square bracket

['word1 word2 3456 ', '[abc def] [ghi jkl] [1234] [-abcd]', ' word [xyz 2345']

      

+3


source to share


3 answers


Use .+?

instead .*

:



>>> re.split("(\[.+?\])", t)
['word1 word2 3456 ', '[abc def]', ' ', '[ghi jkl]', ' ', '[1234]', ' ', '[-abcd]', ' word 2345']

      

+4


source


You can use this regex to split strings:

\s(?=\[)|(?<=\])\s

      

Working demo

enter image description here

But since it separates those spaces, it will consume them and your generated output will be:



word1 word2 3456
[abc def]
[ghi jkl]
[1234]
[-abcd] word 2345

      

So, as a workaround, you can use the above regex to replace matches with a custom token, for example ||| |||

to create something like:

word1 word2 3456||| |||[abc def]||| |||[ghi jkl]||| |||[1234]||| |||[-abcd]||| |||word 2345

      

Then you can use split method on your custom token |||

and it will store spaces as well as:

'word1 word2 3456'
' '
'[abc def]'
' '
'[ghi jkl]'
' '
'[1234]'
' '
'[-abcd]'
' '
'word '

      

+3


source


Try this instead:

re.findall(r"[^\]\[]*|\[[^\]\[]*?\]", t)

      

This will return

['word1 word2 3456 ', '', 'abc def', '', ' ', '', 'ghi jkl', '', ' ', '', '1234', '', ' ', '', '-abcd', '', ' word 2345', '']

      

To remove blank lines, run:

list(filter(None, re.findall(r"[^\]\[]*|\[[^\]\[]*?\]", t)))

      

which returns

['word1 word2 3456 ', 
 'abc def',
 ' ',
 'ghi jkl',
 ' ',
 '1234',
 ' ',
 '-abcd',
 ' word 2345']

      

To explain regex:

re.compile(r"""
    [^\]\[]*     # Zero or more characters that aren't [ or ]
    |            # OR
    \[           # a literal [
    [^\]\[]*?    # Zero or more characters that aren't [ or ]
    \]           # a literal ]""", re.X)

      

0


source







All Articles