Parsing plain text with a section in Python

I have text that looks like this:

    bla bla bla 
    bla some on wanted text....

****************************************************************************
List of 12 base pairs
      nt1              nt2             bp  name         Saenger     LW  DSSR
   1 Q.C0             Q.G22            C-G WC           19-XIX     cWW  cW-W
   2 Q.C1             Q.G21            C-G WC           19-XIX     cWW  cW-W
   3 Q.U2             Q.A20            U-A WC           20-XX      cWW  cW-W

****************************************************************************
another unwanted text ...
another unwanted text 

      

Would like to do to extract the section starting with List of xxx base pairs

and ending with the first *****

it comes across.

There are times when this section is not displayed at all. If this happens it should only output "NONE"

.

How can I do this using Python?

I tried this but couldn't. That it doesn't output output at all.

import sys
import re

def main():
    """docstring for main"""
    infile = "myfile.txt"
    if len(sys.argv) > 1:
        infile = sys.argv[1]

    regex = re.compile(r"""List of (\d+) base pairs$""",re.VERBOSE)

    with open(infile, 'r') as tsvfile:
        tabreader = csv.reader(tsvfile, delimiter='\t')

        for row in tabreader:
            if row:
                line = row[0]
                match = regex.match(line)
                if match:
                    print line



if __name__ == '__main__':
    main()

      

At the end of the code, I was hoping it would just print this:

      nt1              nt2             bp  name         Saenger     LW  DSSR
   1 Q.C0             Q.G22            C-G WC           19-XIX     cWW  cW-W
   2 Q.C1             Q.G21            C-G WC           19-XIX     cWW  cW-W
   3 Q.U2             Q.A20            U-A WC           20-XX      cWW  cW-W

      

Or simply

NONE

      

+3


source to share


4 answers


[ ]*List of \d+ base pairs\n*([\s\S]*?)(?=\n*\*{5,})

      

Try this regex with re.findall

. Watch a demo.



https://regex101.com/r/eZ0yP4/20

import re
p = re.compile(r'[ ]*List of \d+ base pairs\n*([\s\S]*?)(?=\n*\*{5,})')
test_str = " bla bla bla \n bla some on wanted text....\n\n****************************************************************************\nList of 12 base pairs\n nt1 nt2 bp name Saenger LW DSSR\n 1 Q.C0 Q.G22 C-G WC 19-XIX cWW cW-W\n 2 Q.C1 Q.G21 C-G WC 19-XIX cWW cW-W\n 3 Q.U2 Q.A20 U-A WC 20-XX cWW cW-W\n\n****************************************************************************\nanother unwanted text ...\nanother unwanted text "

re.findall(p, test_str)

      

+2


source


At the end of the code, I was hoping it would just print this:

There are several problems. The regex is too strict. The loop does not recognize the regex match as a starting point. And there is no end to the end point *******

.



Here's some working code to get you started:

import re

text = '''
    bla bla bla 
    bla some on wanted text....

****************************************************************************
List of 12 base pairs
      nt1              nt2             bp  name         Saenger     LW  DSSR
   1 Q.C0             Q.G22            C-G WC           19-XIX     cWW  cW-W
   2 Q.C1             Q.G21            C-G WC           19-XIX     cWW  cW-W
   3 Q.U2             Q.A20            U-A WC           20-XX      cWW  cW-W

****************************************************************************
another unwanted text ...
another unwanted text
'''

regex = re.compile(r"List of (\d+) base pairs")

started = False
for line in text.splitlines():
    if started:
        if line.startswith('*******'):
            break
        print line
    elif regex.search(line):
        started = True

      

+4


source


You can use flags MULTILINE

and DOTALL

module re.

#!/usr/bin/python

import re

f = open('myfile.txt','r').read()

pat = re.compile("""
    List\ of\ \d+\ base\ pairs$  # The start of the match
    (.*?)                        # Note ? to make it nongreedy
    ^[*]+$                       # The ending line
    """, re.MULTILINE+re.DOTALL+re.VERBOSE)

mat = pat.search(f)

if mat:
    print mat.group(1).strip()
else:
    print 'NONE'

      

Notes:

  • You need ?

    after .*

    to make it inappropriate if there are multiple lines of stars in the file.
  • The space in the original line must be escaped ( Lists\ of\ ...

    ) as it is used re.VERBOSE

    . Otherwise, this space will be ignored and no match will be found!
+2


source


Another regex to try:

f=open(my_file).read()
print ''.join(re.findall('\s+nt1[^\n]+\n|\s+\d+\sQ\.[^\n]+\n',f,re.M))

      

It accepts any stuff starting with nt1 or number + Q. as in the first line passed to re.findall

.

+1


source







All Articles