Find content between untagged lines
I'm using Python to try and extract data from this old code, and the interesting content isn't between neat HTML tags, but between strings of characters, including punctuation marks and letters. Instead of getting every chunk of content though, I get everything between the first instance of the starting line and the last instance of the ending bounding line. For example:
>>> q = '"text:"content_of_interest_1",body, code code "text:":content_of_interest_2",body'
>>> start1 = '"text:"'
>>> end1 = '",body'
>>> print q[q.find(start1)+len(start1):q.rfind(end1)]
content_of_interest_1",body, code code "text:":content_of_interest_2
Instead, I want to get each content instance, delimited by start1 and end1, i.e .:
content_of_interest_1, content_of_interest_2
How can I rephrase my code to get each instance of string-delimited content, not all of the constrained content as above?
source to share
You need to use q.find
to end1
instead rfind
for the first substring and rfind
for the last:
>>> q[q.find(start1)+len(start1):q.find(end1)]
'content_of_interest_1'
>>> q[q.rfind(start1)+len(start1):q.rfind(end1)]
':content_of_interest_2'
But using find
will only give you the index of the first occurrence start
and end
. Since in a more suitable way for such tasks, you can simply use a regular expression:
>>> re.findall(r':"(.*?)"',q)
['content_of_interest_1', ':content_of_interest_2']
source to share
You can use regex with positive appearance
import re
re.findall(r'(?<="text:"):?\w+', q)
#['content_of_interest_1', ':content_of_interest_2']
source to share