Split string using list of strings as pattern

Question

Split string using list of strings as pattern

Consider the input string:

mystr = "just some stupid string to illustrate my question"

and a list of lines indicating where to split the input string:

splitters = ["some", "illustrate"]

The result should look like

result = ["just ", "some stupid string to ", "illustrate my question"]

I have written some code that implements the following approach. For each of the lines in, splitters

I find its occurrences in the input line and insert what I know for sure won't be part of my input line (like this '!!'

). Then I split the string using the substring I just inserted.

for s in splitters:
    mystr = re.sub(r'(%s)'%s,r'!!\1', mystr)

result = re.split('!!', mystr)

This solution seems ugly, is there a better way to do this?

+3

python split regex

ojy 20 Aug 14 at 19:31

source to share

2 answers

Not particularly elegant, but avoiding regex:

mystr = "just some stupid string to illustrate my question"
splitters = ["some", "illustrate"]
indexes = [0] + [mystr.index(s) for s in splitters] + [len(mystr)]
indexes = sorted(list(set(indexes)))

print [mystr[i:j] for i, j in zip(indexes[:-1], indexes[1:])]
# ['just ', 'some stupid string to ', 'illustrate my question']

I must admit here that it takes a little more work if the word is splitters

found more than once, since str.index

it only finds the location of the first occurrence of the word ...

+2

Alex Riley 20 Aug 14 at 19:45

source to share

hlt · Accepted Answer · 2014-08-20T19:42:07+0000

Splitting with re.split

will always remove the matched line from the output (NB, this is not entirely the case, see below). Therefore, you must use positive expressions ( (?=...)

) to match without removing the match. However, it re.split

ignores empty matches , so simply using the lookahead expression doesn't work. Instead, you will lose one character in every split minimum (even trying to cheat re

with a "border" ( \b

) match does not work). If you don't want to lose one space / non-word character at the end of each element (assuming you only split into non-word characters), you can use something like

re.split(r"\W(?=some|illustrate)")

which would give

["just", "some stupid string to", "illustrate my question"]

(note that there are no spaces after just

and to

). You could then programmatically generate these regular expressions with str.join

. Note that each of the shared markers is escaped with re.escape

, so that special characters in the elements splitters

do not affect the meaning of the regular expression in any undesirable way (imagine, for example, a )

in one of the strings that would otherwise result in a syntax error in the regular expression) ...

the_regex = r"\W(?={})".format("|".join(re.escape(s) for s in splitters))

Edit (HT to @Arkadiy ): Grouping the actual match, that is, using (\W)

instead \W

, returns non-leading characters inserted into the list as separate items. Joining each of the next two items then created the list as desired. Then you can also discard the need to have a wordless character by using (.)

instead \W

:

the_new_regex = r"(.)(?={})".format("|".join(re.escape(s) for s in splitters))
the_split = re.split(the_new_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest(the_split[::2], the_split[1::2], fillvalue='')]

Since normal text and auxiliary characters alternate, the_split[::2]

contains normal delimited text and the_split[1::2]

auxiliary characters. It itertools.izip_longest

is then used to concatenate each text element with the corresponding deleted character and the last element (which is unmatched in deleted characters)) with fillvalue

, i.e. ''

... Then each of these tuples is concatenated with "".join(x)

. Note that this requires an import itertools

(you could of course do this in a simple loop, but itertools

provides very clean solutions for these things). Also note what itertools.izip_longest

is called itertools.zip_longest

in Python 3.

This leads to further simplification of the regex, because instead of using auxiliary characters, lookahead can be replaced with a simple matching group ( (some|interesting)

instead of (.)(?=some|interesting)

):

the_newest_regex = "({})".format("|".join(re.escape(s) for s in splitters))
the_raw_split = re.split(the_newest_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest([""] + the_raw_split[1::2], the_raw_split[::2], fillvalue='')]

This is where the slice indices the_raw_split

are not swapped because now even numbered elements must be added to the element later and not in front. Also notice the part [""] +

that is needed to pair the first element with ""

to correct the order.

(end of editing)

Alternatively you can (if you like) use string.replace

instead re.sub

for each splitter (I think this is a matter of preference in your case, but overall it is probably more efficient)

for s in splitters:
    mystr = mystr.replace(s, "!!" + s)

Also, if you are using a fixed token to indicate where to share, you don't need to re.split

, but you can use instead string.split

:

result = mystr.split("!!")

What you can also do (instead of relying on the replacement token so it is not found anywhere in Nigeria, or relying on every split position that is preceded by a non-word character) finds the split strings in the input using string.find

and using lowercase slicing to extract fragments:

def split(string, splitters):
    while True:
        # Get the positions to split at for all splitters still in the string
        # that are not at the very front of the string
        split_positions = [i for i in (string.find(s) for s in splitters) if i > 0]
        if len(split_positions) > 0:
            # There is still somewhere to split
            next_split = min(split_positions)
            yield string[:next_split] # Yield everything before that position
            string = string[next_split:] # Retain the rest of the string
        else:
            yield string # Yield the rest of the string
            break # Done.

Here [i for i in (string.find(s) for s in splitters) if i > 0]

generates a list of positions at which splitters can be found, for all splitters that are in the string (excluded for this i < 0

), not at the beginning (where we (perhaps) just split, therefore i == 0

excluded as well). If there are any leftovers in the string, we get (this is a generator function) everything up to (excluding) the first delimiter (in min(split_positions)

) and replace the string with the rest. If there are no more, we print the last part of the line and exit the function. Since it uses yield

, it is a generator function, so you need to use list

to turn it into an actual list.

Note that you can also replace yield whatever

with a call some_list.append

(if you defined some_list

earlier) and return some_list

at the very end, I don't think this is very good code style though.

TL; DR

If you are using regular expressions use

the_newest_regex = "({})".format("|".join(re.escape(s) for s in splitters))
the_raw_split = re.split(the_newest_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest([""] + the_raw_split[1::2], the_raw_split[::2], fillvalue='')]

else, the same can be achieved with the string.find

following split function:

def split(string, splitters):
    while True:
        # Get the positions to split at for all splitters still in the string
        # that are not at the very front of the string
        split_positions = [i for i in (string.find(s) for s in splitters) if i > 0]
        if len(split_positions) > 0:
            # There is still somewhere to split
            next_split = min(split_positions)
            yield string[:next_split] # Yield everything before that position
            string = string[next_split:] # Retain the rest of the string
        else:
            yield string # Yield the rest of the string
            break # Done.

Split string using list of strings as pattern

TL; DR

More articles: