Separate comma, space or semicolon separator using regex

I am using regex [,; \ s] + to separate a comma, space, or semicolon separated string. This works fine if the line doesn't have a comma at the end:

>>> p=re.compile('[,;\s]+')
>>> mystring='a,,b,c'
>>> p.split(mystring)
['a', 'b', 'c']

      

When the line has a comma at the end:

>>> mystring='a,,b,c,'
>>> p.split(mystring)
['a', 'b', 'c', '']

      

I want the output in this case to be ['a', 'b', 'c'].

Any suggestions for regex?

+3


source to share


3 answers


Try:



str = 'a,,b,c,'
re.findall(r'[^,;\s]+', str)

      

+5


source


Here's something very low tech that should still work:

mystring='a,,b,c'
for delim in ',;':
    mystring = mystring.replace(delim, ' ')
results = mystring.split()

      



PS : While regular expressions are very useful, I highly recommend that you think twice about whether this is the right tool for the job here. While I'm not sure what the exact runtime of the compiled regex is (I think no more than O (n ^ 2)), it is definitely not faster than O (n), which is the runtime string.replace

. Therefore, unless there is another reason why you need to use regex you should install this solution

+7


source


Well, the split technically worked. In a,,b,c

it splits into ,,

and ,

, leaving "a", "b" and "c". In a,,b,c,

it is split into ,,

, ,

and into the last one ,

(because they all match a regular expression!). The lines "around" these dividers are "a", "b", "c" and "" (between the last comma and the end of the line).

There are several ways to get around this.

  • An empty line will only appear if there is a delimiter at the beginning or end of the line, so trim any of those [,;\s]

    before splitting with str.strip

    :

    p.split(mystring.strip(',; \t\r\n'))
    
          

  • Remove empty line after splitting using whichever method you like

    res = p.split(mystring)
    [r for r in res if r != '']
    # another option
    filter(None,res)
    
          

  • Better yet, since you know you will only get an empty string as the first or last part of a split string (like ,a,b,c

    or a,b,c,

    ), don't iterate over the entire chunk

    res = p.slit(mystring)
    # this one relies on coercing logical to numbers:
    # if res[0] is '' it'll be 1:X, otherwise it'll be 0:X,
    #  where X is len(res) if res[-1] is not '', and len(res)-1 otherwise.
    res[ res[0]=='':(len(res)-(res[-1]==''))]
    
          

+3


source







All Articles