Find content between untagged lines

Question

Find content between untagged lines

I'm using Python to try and extract data from this old code, and the interesting content isn't between neat HTML tags, but between strings of characters, including punctuation marks and letters. Instead of getting every chunk of content though, I get everything between the first instance of the starting line and the last instance of the ending bounding line. For example:

>>> q = '"text:"content_of_interest_1",body, code code "text:":content_of_interest_2",body'

>>> start1 = '"text:"'

>>> end1 = '",body'

>>> print q[q.find(start1)+len(start1):q.rfind(end1)]
content_of_interest_1",body, code code "text:":content_of_interest_2

Instead, I want to get each content instance, delimited by start1 and end1, i.e .:

content_of_interest_1, content_of_interest_2

How can I rephrase my code to get each instance of string-delimited content, not all of the constrained content as above?

+3

python search

kill3rTcell 02 june 15 at 10:35

source to share

2 answers

You can use regex with positive appearance

import re
re.findall(r'(?<="text:"):?\w+', q)
#['content_of_interest_1', ':content_of_interest_2']

+1

styvane 02 june 15 at 10:49

source to share

Kasramvd · Accepted Answer · 2015-06-02T10:41:35+0000

You need to use q.find

to end1

instead rfind

for the first substring and rfind

for the last:

>>> q[q.find(start1)+len(start1):q.find(end1)]
'content_of_interest_1'
>>> q[q.rfind(start1)+len(start1):q.rfind(end1)]
':content_of_interest_2'

But using find

will only give you the index of the first occurrence start

and end

. Since in a more suitable way for such tasks, you can simply use a regular expression:

>>> re.findall(r':"(.*?)"',q)
['content_of_interest_1', ':content_of_interest_2']

Find content between untagged lines

More articles: