Find content between untagged lines

I'm using Python to try and extract data from this old code, and the interesting content isn't between neat HTML tags, but between strings of characters, including punctuation marks and letters. Instead of getting every chunk of content though, I get everything between the first instance of the starting line and the last instance of the ending bounding line. For example:

>>> q = '"text:"content_of_interest_1",body, code code "text:":content_of_interest_2",body'

>>> start1 = '"text:"'

>>> end1 = '",body'

>>> print q[q.find(start1)+len(start1):q.rfind(end1)]
content_of_interest_1",body, code code "text:":content_of_interest_2

      

Instead, I want to get each content instance, delimited by start1 and end1, i.e .:

content_of_interest_1, content_of_interest_2

      

How can I rephrase my code to get each instance of string-delimited content, not all of the constrained content as above?

+3


source to share


2 answers


You need to use q.find

to end1

instead rfind

for the first substring and rfind

for the last:

>>> q[q.find(start1)+len(start1):q.find(end1)]
'content_of_interest_1'
>>> q[q.rfind(start1)+len(start1):q.rfind(end1)]
':content_of_interest_2'

      



But using find

will only give you the index of the first occurrence start

and end

. Since in a more suitable way for such tasks, you can simply use a regular expression:

>>> re.findall(r':"(.*?)"',q)
['content_of_interest_1', ':content_of_interest_2']

      

+1


source


You can use regex with positive appearance



import re
re.findall(r'(?<="text:"):?\w+', q)
#['content_of_interest_1', ':content_of_interest_2']

      

+1


source