Extract items separated by square brackets using python regular expressions
I am trying to split words / phrases separated by square brackets using python regex. I want to split the output. The conditions are that the section of text starting and ending with square brackets will be split into another element.
This is what I have so far, but it doesn't work as expected:
import re
t="word1 word2 3456 [abc def] [ghi jkl] [1234] [-abcd] word 2345"
re.split("(\[)(.*)(\])+",t)
Output:
['word1 word2 3456 ',
'[',
'abc def] [ghi jkl] [1234] [-abcd',
']',
' word [xyz 2345']
I want the output to be something like:
['word1 word2 3456 ',
'[abc def]',
' ',
'[ghi jkl]',
' ',
'[1234]',
' ',
'[-abcd]',
' word [xyz 2345']
Note that only the open and close brackets are separated.
I also tried this:
re.split("(\[.*\])+",t)
but only breaks into the first and last square bracket
['word1 word2 3456 ', '[abc def] [ghi jkl] [1234] [-abcd]', ' word [xyz 2345']
source to share
You can use this regex to split strings:
\s(?=\[)|(?<=\])\s
But since it separates those spaces, it will consume them and your generated output will be:
word1 word2 3456
[abc def]
[ghi jkl]
[1234]
[-abcd] word 2345
So, as a workaround, you can use the above regex to replace matches with a custom token, for example ||| |||
to create something like:
word1 word2 3456||| |||[abc def]||| |||[ghi jkl]||| |||[1234]||| |||[-abcd]||| |||word 2345
Then you can use split method on your custom token |||
and it will store spaces as well as:
'word1 word2 3456'
' '
'[abc def]'
' '
'[ghi jkl]'
' '
'[1234]'
' '
'[-abcd]'
' '
'word '
source to share
Try this instead:
re.findall(r"[^\]\[]*|\[[^\]\[]*?\]", t)
This will return
['word1 word2 3456 ', '', 'abc def', '', ' ', '', 'ghi jkl', '', ' ', '', '1234', '', ' ', '', '-abcd', '', ' word 2345', '']
To remove blank lines, run:
list(filter(None, re.findall(r"[^\]\[]*|\[[^\]\[]*?\]", t)))
which returns
['word1 word2 3456 ',
'abc def',
' ',
'ghi jkl',
' ',
'1234',
' ',
'-abcd',
' word 2345']
To explain regex:
re.compile(r"""
[^\]\[]* # Zero or more characters that aren't [ or ]
| # OR
\[ # a literal [
[^\]\[]*? # Zero or more characters that aren't [ or ]
\] # a literal ]""", re.X)
source to share