Parsing plain text with a section in Python
I have text that looks like this:
bla bla bla
bla some on wanted text....
****************************************************************************
List of 12 base pairs
nt1 nt2 bp name Saenger LW DSSR
1 Q.C0 Q.G22 C-G WC 19-XIX cWW cW-W
2 Q.C1 Q.G21 C-G WC 19-XIX cWW cW-W
3 Q.U2 Q.A20 U-A WC 20-XX cWW cW-W
****************************************************************************
another unwanted text ...
another unwanted text
Would like to do to extract the section starting with List of xxx base pairs
and ending with the first *****
it comes across.
There are times when this section is not displayed at all. If this happens it should only output "NONE"
.
How can I do this using Python?
I tried this but couldn't. That it doesn't output output at all.
import sys
import re
def main():
"""docstring for main"""
infile = "myfile.txt"
if len(sys.argv) > 1:
infile = sys.argv[1]
regex = re.compile(r"""List of (\d+) base pairs$""",re.VERBOSE)
with open(infile, 'r') as tsvfile:
tabreader = csv.reader(tsvfile, delimiter='\t')
for row in tabreader:
if row:
line = row[0]
match = regex.match(line)
if match:
print line
if __name__ == '__main__':
main()
At the end of the code, I was hoping it would just print this:
nt1 nt2 bp name Saenger LW DSSR
1 Q.C0 Q.G22 C-G WC 19-XIX cWW cW-W
2 Q.C1 Q.G21 C-G WC 19-XIX cWW cW-W
3 Q.U2 Q.A20 U-A WC 20-XX cWW cW-W
Or simply
NONE
source to share
[ ]*List of \d+ base pairs\n*([\s\S]*?)(?=\n*\*{5,})
Try this regex with re.findall
. Watch a demo.
https://regex101.com/r/eZ0yP4/20
import re
p = re.compile(r'[ ]*List of \d+ base pairs\n*([\s\S]*?)(?=\n*\*{5,})')
test_str = " bla bla bla \n bla some on wanted text....\n\n****************************************************************************\nList of 12 base pairs\n nt1 nt2 bp name Saenger LW DSSR\n 1 Q.C0 Q.G22 C-G WC 19-XIX cWW cW-W\n 2 Q.C1 Q.G21 C-G WC 19-XIX cWW cW-W\n 3 Q.U2 Q.A20 U-A WC 20-XX cWW cW-W\n\n****************************************************************************\nanother unwanted text ...\nanother unwanted text "
re.findall(p, test_str)
source to share
At the end of the code, I was hoping it would just print this:
There are several problems. The regex is too strict. The loop does not recognize the regex match as a starting point. And there is no end to the end point *******
.
Here's some working code to get you started:
import re
text = '''
bla bla bla
bla some on wanted text....
****************************************************************************
List of 12 base pairs
nt1 nt2 bp name Saenger LW DSSR
1 Q.C0 Q.G22 C-G WC 19-XIX cWW cW-W
2 Q.C1 Q.G21 C-G WC 19-XIX cWW cW-W
3 Q.U2 Q.A20 U-A WC 20-XX cWW cW-W
****************************************************************************
another unwanted text ...
another unwanted text
'''
regex = re.compile(r"List of (\d+) base pairs")
started = False
for line in text.splitlines():
if started:
if line.startswith('*******'):
break
print line
elif regex.search(line):
started = True
source to share
You can use flags MULTILINE
and DOTALL
module re.
#!/usr/bin/python
import re
f = open('myfile.txt','r').read()
pat = re.compile("""
List\ of\ \d+\ base\ pairs$ # The start of the match
(.*?) # Note ? to make it nongreedy
^[*]+$ # The ending line
""", re.MULTILINE+re.DOTALL+re.VERBOSE)
mat = pat.search(f)
if mat:
print mat.group(1).strip()
else:
print 'NONE'
Notes:
- You need
?
after.*
to make it inappropriate if there are multiple lines of stars in the file. - The space in the original line must be escaped (
Lists\ of\ ...
) as it is usedre.VERBOSE
. Otherwise, this space will be ignored and no match will be found!
source to share