Complex regular expressions in python

I have a txt file containing the following data:

CHRI

ATGCCTTGGGCAACGGT ... (multiple lines)

chrII

AGGTTGGCCAAGGTT ... (multiple lines)

I want to search for "chrI" first and then repeat several ATGC lines until I find the xth char. Then I want to print xth char to yth char. I am using regex, but once I found the line that contains chrI, I have no idea how to continue iterating to find the xth char.

Here is my code:

for i, line in enumerate(sacc_gff):
    for match in re.finditer(chromo_val, line):
        print(line)
        for match in re.finditer(r"[ATGC]{%d},{%d}\Z" % (int(amino_start), int(amino_end)), line):
            print(match.group())

      

What do the variables mean:

chromo_val

= chrI

amino_start

= (some starting point found by my program)

amino_end

= (some endpoint found by my program)

Note: amino_start

and amino_end

must be in variable form.

Please let me know if I can clarify anything for you, thanks.

+3


source to share


2 answers


It looks like you are working with fasta data, so I will give you an answer for that, but if you are not, you can use the sub_sequence selection part.

fasta_data = {} # creates an empty dictionary
with open( fasta_file, 'r' ) as fh:
    for line in fh:
        if line[0] == '>':
            seq_id = line.rstrip()[1:] # strip newline character and remove leading '>' character
            fasta_data[seq_id] = ''
        else:
            fasta_data[seq_id] += line.rstrip()

# return substring from chromosome 'chrI' with a first character at amino_start up to but not including amino_end
sequence_string1 = fasta_data['chrI'][amino_start:amino_end]
# return substring from chromosome 'chrII' with a first character at amino_start up to and including amino_end
sequence_string2 = fasta_data['chrII'][amino_start:amino_end+1]

      



fasta format:

>chr1
ATTTATATATAT
ATGGCGCGATCG
>chr2
AATCGCTGCTGC

      

+3


source


Since you are working with fasta files which are formatted like this:

>Chr1
ATCGACTACAAATTT
>Chr2
ACCTGCCGTAAAAATTTCC

      

and are a major bioinformatist. I assume you will be manipulating sequences, I often recommend installing a perl package called FAST. When this is set to get 2-14 characters of each sequence, you do the following:



fascut 2..14 fasta_file.fa

      

Here's a recent post for FAST and github contains a whole suite of tools for manipulating molecular sequence data on the command line.

0


source







All Articles