Split string using list of strings as pattern
Consider the input string:
mystr = "just some stupid string to illustrate my question"
and a list of lines indicating where to split the input string:
splitters = ["some", "illustrate"]
The result should look like
result = ["just ", "some stupid string to ", "illustrate my question"]
I have written some code that implements the following approach. For each of the lines in, splitters
I find its occurrences in the input line and insert what I know for sure won't be part of my input line (like this '!!'
). Then I split the string using the substring I just inserted.
for s in splitters:
mystr = re.sub(r'(%s)'%s,r'!!\1', mystr)
result = re.split('!!', mystr)
This solution seems ugly, is there a better way to do this?
source to share
Splitting with re.split
will always remove the matched line from the output (NB, this is not entirely the case, see below). Therefore, you must use positive expressions ( (?=...)
) to match without removing the match. However, it re.split
ignores empty matches , so simply using the lookahead expression doesn't work. Instead, you will lose one character in every split minimum (even trying to cheat re
with a "border" ( \b
) match does not work). If you don't want to lose one space / non-word character at the end of each element (assuming you only split into non-word characters), you can use something like
re.split(r"\W(?=some|illustrate)")
which would give
["just", "some stupid string to", "illustrate my question"]
(note that there are no spaces after just
and to
). You could then programmatically generate these regular expressions with str.join
. Note that each of the shared markers is escaped with re.escape
, so that special characters in the elements splitters
do not affect the meaning of the regular expression in any undesirable way (imagine, for example, a )
in one of the strings that would otherwise result in a syntax error in the regular expression) ...
the_regex = r"\W(?={})".format("|".join(re.escape(s) for s in splitters))
Edit (HT to @Arkadiy ): Grouping the actual match, that is, using (\W)
instead \W
, returns non-leading characters inserted into the list as separate items. Joining each of the next two items then created the list as desired. Then you can also discard the need to have a wordless character by using (.)
instead \W
:
the_new_regex = r"(.)(?={})".format("|".join(re.escape(s) for s in splitters))
the_split = re.split(the_new_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest(the_split[::2], the_split[1::2], fillvalue='')]
Since normal text and auxiliary characters alternate, the_split[::2]
contains normal delimited text and the_split[1::2]
auxiliary characters. It itertools.izip_longest
is then used to concatenate each text element with the corresponding deleted character and the last element (which is unmatched in deleted characters)) with fillvalue
, i.e. ''
... Then each of these tuples is concatenated with "".join(x)
. Note that this requires an import itertools
(you could of course do this in a simple loop, but itertools
provides very clean solutions for these things). Also note what itertools.izip_longest
is called itertools.zip_longest
in Python 3.
This leads to further simplification of the regex, because instead of using auxiliary characters, lookahead can be replaced with a simple matching group ( (some|interesting)
instead of (.)(?=some|interesting)
):
the_newest_regex = "({})".format("|".join(re.escape(s) for s in splitters))
the_raw_split = re.split(the_newest_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest([""] + the_raw_split[1::2], the_raw_split[::2], fillvalue='')]
This is where the slice indices the_raw_split
are not swapped because now even numbered elements must be added to the element later and not in front. Also notice the part [""] +
that is needed to pair the first element with ""
to correct the order.
(end of editing)
Alternatively you can (if you like) use string.replace
instead re.sub
for each splitter (I think this is a matter of preference in your case, but overall it is probably more efficient)
for s in splitters:
mystr = mystr.replace(s, "!!" + s)
Also, if you are using a fixed token to indicate where to share, you don't need to re.split
, but you can use instead string.split
:
result = mystr.split("!!")
What you can also do (instead of relying on the replacement token so it is not found anywhere in Nigeria, or relying on every split position that is preceded by a non-word character) finds the split strings in the input using string.find
and using lowercase slicing to extract fragments:
def split(string, splitters):
while True:
# Get the positions to split at for all splitters still in the string
# that are not at the very front of the string
split_positions = [i for i in (string.find(s) for s in splitters) if i > 0]
if len(split_positions) > 0:
# There is still somewhere to split
next_split = min(split_positions)
yield string[:next_split] # Yield everything before that position
string = string[next_split:] # Retain the rest of the string
else:
yield string # Yield the rest of the string
break # Done.
Here [i for i in (string.find(s) for s in splitters) if i > 0]
generates a list of positions at which splitters can be found, for all splitters that are in the string (excluded for this i < 0
), not at the beginning (where we (perhaps) just split, therefore i == 0
excluded as well). If there are any leftovers in the string, we get (this is a generator function) everything up to (excluding) the first delimiter (in min(split_positions)
) and replace the string with the rest. If there are no more, we print the last part of the line and exit the function. Since it uses yield
, it is a generator function, so you need to use list
to turn it into an actual list.
Note that you can also replace yield whatever
with a call some_list.append
(if you defined some_list
earlier) and return some_list
at the very end, I don't think this is very good code style though.
TL; DR
If you are using regular expressions use
the_newest_regex = "({})".format("|".join(re.escape(s) for s in splitters))
the_raw_split = re.split(the_newest_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest([""] + the_raw_split[1::2], the_raw_split[::2], fillvalue='')]
else, the same can be achieved with the string.find
following split function:
def split(string, splitters):
while True:
# Get the positions to split at for all splitters still in the string
# that are not at the very front of the string
split_positions = [i for i in (string.find(s) for s in splitters) if i > 0]
if len(split_positions) > 0:
# There is still somewhere to split
next_split = min(split_positions)
yield string[:next_split] # Yield everything before that position
string = string[next_split:] # Retain the rest of the string
else:
yield string # Yield the rest of the string
break # Done.
source to share
Not particularly elegant, but avoiding regex:
mystr = "just some stupid string to illustrate my question"
splitters = ["some", "illustrate"]
indexes = [0] + [mystr.index(s) for s in splitters] + [len(mystr)]
indexes = sorted(list(set(indexes)))
print [mystr[i:j] for i, j in zip(indexes[:-1], indexes[1:])]
# ['just ', 'some stupid string to ', 'illustrate my question']
I must admit here that it takes a little more work if the word is splitters
found more than once, since str.index
it only finds the location of the first occurrence of the word ...
source to share