How can I find the position of a list of substrings from a string?
How can I find the position of a list of substrings from a string?
Given the line:
"A plane bound for St. Petersburg crashed in the Sinai desert in Egypt just 23 minutes after taking off from Sharm el Sheikh on Saturday."
And the substring list:
['', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', 's',' Sinai ',' desert ',' just ',' 23 ',' minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']
Desired output:
>>> s = "The plane, bound for St Petersburg, crashed in Egypt Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday."
>>> tokens = ['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']
>>> find_offsets(tokens, s)
[(0, 3), (4, 9), (9, 10), (11, 16), (17, 20), (21, 23), (24, 34),
(34, 35), (36, 43), (44, 46), (47, 52), (52, 54), (55, 60), (61, 67),
(68, 72), (73, 75), (76, 83), (84, 89), (90, 98), (99, 103), (104, 109),
(110, 119), (120, 122), (123, 131), (131, 132)]
Explanation of the output, the first substring "The" can be found by index (start, end)
using the string s
. So from the desired exit.
So, if we skip all tuples of integers from the desired result, we will return a list of substrings, i.e.
>>> [s[start:end] for start, end in out]
['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']
I tried:
def find_offset(tokens, s):
index = 0
offsets = []
for token in tokens:
start = s[index:].index(token) + index
index = start + len(token)
offsets.append((start, index))
return offsets
Is there any other way to find the position of a list of substrings from a string?
source to share
If we have no idea about substrings, there is no way other than to re-scan the entire text for each of them.
If, as it seems from the data, we know that these are sequential chunks of text, given in text order, it is easy to scan the rest of the text after each match. However, it makes no sense to cut the text every time.
def spans(text, fragments):
result = []
point = 0 # Where we're in the text.
for fragment in fragments:
found_start = text.index(fragment, point)
found_end = found_start + len(fragment)
result.append((found_start, found_end))
point = found_end
return result
Test:
>>> spans('foo in bar', ['foo', 'in', 'bar'])
[(0, 3), (4, 6), (7, 10)]
This assumes that each fragment is present in the text in the right place. The output format is not an example of a mismatch. Using .find
instead .index
can help, albeit somewhat.
source to share
First solution:
#use list comprehension and list.index function.
[tuple((s.index(e),s.index(e)+len(e))) for e in t]
Second solution to fix problems in first solution:
def find_offsets(tokens, s):
tid = [list(e) for e in tokens]
i = 0
for id_token,token in enumerate(tid):
while (token[0]!=s[i]):
i+=1
tid[id_token] = tuple((i,i+len(token)))
i+=len(token)
return tid
find_offsets(tokens, s)
Out[201]:
[(0, 3),
(4, 9),
(9, 10),
(11, 16),
(17, 20),
(21, 23),
(24, 34),
(34, 35),
(36, 43),
(44, 46),
(47, 52),
(52, 54),
(55, 60),
(61, 67),
(68, 72),
(73, 75),
(76, 83),
(84, 89),
(90, 98),
(99, 103),
(104, 109),
(110, 119),
(120, 122),
(123, 131),
(131, 132)]
#another test
s = 'The plane, plane'
t = ['The', 'plane', ',', 'plane']
find_offsets(t,s)
Out[212]: [(0, 3), (4, 9), (9, 10), (11, 16)]
source to share
import re
s = "The plane, bound for St Petersburg, crashed in Egypt Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday."
tokens = ['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']
for token in tokens:
pattern = re.compile(re.escape(token))
print(pattern.search(s).span())
RESULT
(0, 3) (4, 9) (9, 10) (11, 16) (17, 20) (21, 23) (24, 34) (9, 10) (36, 43) (44, 46) (47, 52) (52, 54) (55, 60) (61, 67) (68, 72) (73, 75) (76, 83) (84, 89) (90, 98) (99, 103) (104, 109) (110, 119) (120, 122) (123, 131) (131, 132)
source to share