Separate comma, space or semicolon separator using regex
I am using regex [,; \ s] + to separate a comma, space, or semicolon separated string. This works fine if the line doesn't have a comma at the end:
>>> p=re.compile('[,;\s]+')
>>> mystring='a,,b,c'
>>> p.split(mystring)
['a', 'b', 'c']
When the line has a comma at the end:
>>> mystring='a,,b,c,'
>>> p.split(mystring)
['a', 'b', 'c', '']
I want the output in this case to be ['a', 'b', 'c'].
Any suggestions for regex?
source to share
Here's something very low tech that should still work:
mystring='a,,b,c'
for delim in ',;':
mystring = mystring.replace(delim, ' ')
results = mystring.split()
PS : While regular expressions are very useful, I highly recommend that you think twice about whether this is the right tool for the job here. While I'm not sure what the exact runtime of the compiled regex is (I think no more than O (n ^ 2)), it is definitely not faster than O (n), which is the runtime string.replace
. Therefore, unless there is another reason why you need to use regex you should install this solution
source to share
Well, the split technically worked. In a,,b,c
it splits into ,,
and ,
, leaving "a", "b" and "c". In a,,b,c,
it is split into ,,
, ,
and into the last one ,
(because they all match a regular expression!). The lines "around" these dividers are "a", "b", "c" and "" (between the last comma and the end of the line).
There are several ways to get around this.
-
An empty line will only appear if there is a delimiter at the beginning or end of the line, so trim any of those
[,;\s]
before splitting withstr.strip
:p.split(mystring.strip(',; \t\r\n'))
-
Remove empty line after splitting using whichever method you like
res = p.split(mystring) [r for r in res if r != ''] # another option filter(None,res)
-
Better yet, since you know you will only get an empty string as the first or last part of a split string (like
,a,b,c
ora,b,c,
), don't iterate over the entire chunkres = p.slit(mystring) # this one relies on coercing logical to numbers: # if res[0] is '' it'll be 1:X, otherwise it'll be 0:X, # where X is len(res) if res[-1] is not '', and len(res)-1 otherwise. res[ res[0]=='':(len(res)-(res[-1]==''))]
source to share