How can I find the position of a list of substrings from a string?

How can I find the position of a list of substrings from a string?

Given the line:

"A plane bound for St. Petersburg crashed in the Sinai desert in Egypt just 23 minutes after taking off from Sharm el Sheikh on Saturday."

And the substring list:

['', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', 's',' Sinai ',' desert ',' just ',' 23 ',' minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']

Desired output:

>>> s = "The plane, bound for St Petersburg, crashed in Egypt Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday."
>>> tokens = ['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']
>>> find_offsets(tokens, s)
[(0, 3), (4, 9), (9, 10), (11, 16), (17, 20), (21, 23), (24, 34),
        (34, 35), (36, 43), (44, 46), (47, 52), (52, 54), (55, 60), (61, 67),
        (68, 72), (73, 75), (76, 83), (84, 89), (90, 98), (99, 103), (104, 109),
        (110, 119), (120, 122), (123, 131), (131, 132)]

      

Explanation of the output, the first substring "The" can be found by index (start, end)

using the string s

. So from the desired exit.

So, if we skip all tuples of integers from the desired result, we will return a list of substrings, i.e.

>>> [s[start:end] for start, end in out]
['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']

      

I tried:

def find_offset(tokens, s):
    index = 0
    offsets = []
    for token in tokens:
        start = s[index:].index(token) + index
        index = start + len(token)
        offsets.append((start, index))
    return offsets

      

Is there any other way to find the position of a list of substrings from a string?

+3


source to share


3 answers


If we have no idea about substrings, there is no way other than to re-scan the entire text for each of them.

If, as it seems from the data, we know that these are sequential chunks of text, given in text order, it is easy to scan the rest of the text after each match. However, it makes no sense to cut the text every time.

def spans(text, fragments):
    result = []
    point = 0  # Where we're in the text.
    for fragment in fragments:
        found_start = text.index(fragment, point)
        found_end = found_start + len(fragment)
        result.append((found_start, found_end))
        point = found_end
    return result

      



Test:

>>> spans('foo in bar', ['foo', 'in', 'bar'])
[(0, 3), (4, 6), (7, 10)]

      

This assumes that each fragment is present in the text in the right place. The output format is not an example of a mismatch. Using .find

instead .index

can help, albeit somewhat.

+1


source


First solution:

#use list comprehension and list.index function.
[tuple((s.index(e),s.index(e)+len(e))) for e in t]

      



Second solution to fix problems in first solution:

def find_offsets(tokens, s):
    tid = [list(e) for e in tokens]
    i = 0
    for id_token,token in enumerate(tid):
        while (token[0]!=s[i]):            
            i+=1
        tid[id_token] = tuple((i,i+len(token)))
        i+=len(token)

    return tid


find_offsets(tokens, s)
Out[201]: 
[(0, 3),
 (4, 9),
 (9, 10),
 (11, 16),
 (17, 20),
 (21, 23),
 (24, 34),
 (34, 35),
 (36, 43),
 (44, 46),
 (47, 52),
 (52, 54),
 (55, 60),
 (61, 67),
 (68, 72),
 (73, 75),
 (76, 83),
 (84, 89),
 (90, 98),
 (99, 103),
 (104, 109),
 (110, 119),
 (120, 122),
 (123, 131),
 (131, 132)]   

#another test
s = 'The plane, plane'
t = ['The', 'plane', ',', 'plane']
find_offsets(t,s)
Out[212]: [(0, 3), (4, 9), (9, 10), (11, 16)]

      

+4


source


import re

s = "The plane, bound for St Petersburg, crashed in Egypt Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday."
tokens = ['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']


for token in tokens:
  pattern = re.compile(re.escape(token))
  print(pattern.search(s).span())

      

RESULT

(0, 3)
(4, 9)
(9, 10)
(11, 16)
(17, 20)
(21, 23)
(24, 34)
(9, 10)
(36, 43)
(44, 46)
(47, 52)
(52, 54)
(55, 60)
(61, 67)
(68, 72)
(73, 75)
(76, 83)
(84, 89)
(90, 98)
(99, 103)
(104, 109)
(110, 119)
(120, 122)
(123, 131)
(131, 132)

      

+1


source







All Articles