Opening and editing multiple files in a folder using python

I am trying to modify my .fasta files like this:

>YP_009208724.1 hypothetical protein ADP65_00072 [Achromobacter phage phiAxp-3]
MSNVLLKQ...

>YP_009220341.1 terminase large subunit [Achromobacter phage phiAxp-1]
MRTPSKSE...

>YP_009226430.1 DNA packaging protein [Achromobacter phage phiAxp-2]
MMNSDAVI...

      

:

>Achromobacter phage phiAxp-3
MSNVLLKQ...

>Achromobacter phage phiAxp-1
MRTPSKSE...

>Achromobacter phage phiAxp-2
MMNSDAVI...

      

Now I already have a script that can do this in a single file:

with open('Achromobacter.fasta', 'r') as fasta_file:
    out_file = open('./fastas3/Achromobacter.fasta', 'w')
    for line in fasta_file:
        line = line.rstrip()
        if '[' in line:
            line = line.split('[')[-1]
            out_file.write('>' + line[:-1] + "\n")
        else:
            out_file.write(str(line) + "\n")

      

but I cannot automate the process for all 120 files in my folder.

I tried using glob.glob, but I cannot get it to work:

import glob

for fasta_file in glob.glob('*.fasta'):
    outfile = open('./fastas3/'+fasta_file, 'w')
    with open(fasta_file, 'r'):
        for line in fasta_file:
            line = line.rstrip()
            if '[' in line:
                line2 = line.split('[')[-1]
                outfile.write('>' + line2[:-1] + "\n")
            else:
                outfile.write(str(line) + "\n")

      

it gives me this result:

A
c
i
n
e
t
o
b
a
c
t
e
r
.
f
a
s
t
a

      

I was able to get a list of all files in a folder, but I cannot open certain files using the object in the list.

import os


file_list = []
for file in os.listdir("./fastas2/"):
    if file.endswith(".fasta"):
        file_list.append(file)

      

+3


source to share


2 answers


Given that you can now change the contents of the filename, you need to automate the process. We changed the function for a single file by removing the file handler that was used twice to open the file.

def file_changer(filename):
    data_to_put = ''
    with open(filename, 'r+') as fasta_file:
        for line in fasta_file.readlines():
            line = line.rstrip()
            if '[' in line:
                line = line.split('[')[-1]
                data_to_put += '>' + str(line[:-1]) + "\n"
            else:
                data_to_put += str(line) + "\n"
        fasta_file.write(data_to_put) 
        fasta_file.close()

      



Now we need to iterate over all your files. So let's use the glob

module for it

import glob
for file in glob.glob('*.fasta'):
    file_changer(file)

      

+2


source


Iterates over the filename, which gives you all the characters in the name instead of the lines of the file. Here is the corrected version of the code:

import glob

for fasta_file_name in glob.glob('*.fasta'):
    with open(fasta_file_name, 'r') as fasta_file, \
            open('./fastas3/' + fasta_file_name, 'w') as outfile:
        for line in fasta_file:
            line = line.rstrip()
            if '[' in line:
                line2 = line.split('[')[-1]
                outfile.write('>' + line2[:-1] + "\n")
            else:
                outfile.write(str(line) + "\n")

      

Alternatively to a Python script, you can simply use sed

from the command line:



sed -i 's/^>.*\[\(.*\)\].*$/>\1/' *.fasta

      

This will change all files in place, so copy them first.

+1


source







All Articles