Opening and editing multiple files in a folder using python
I am trying to modify my .fasta files like this:
>YP_009208724.1 hypothetical protein ADP65_00072 [Achromobacter phage phiAxp-3]
MSNVLLKQ...
>YP_009220341.1 terminase large subunit [Achromobacter phage phiAxp-1]
MRTPSKSE...
>YP_009226430.1 DNA packaging protein [Achromobacter phage phiAxp-2]
MMNSDAVI...
:
>Achromobacter phage phiAxp-3
MSNVLLKQ...
>Achromobacter phage phiAxp-1
MRTPSKSE...
>Achromobacter phage phiAxp-2
MMNSDAVI...
Now I already have a script that can do this in a single file:
with open('Achromobacter.fasta', 'r') as fasta_file:
out_file = open('./fastas3/Achromobacter.fasta', 'w')
for line in fasta_file:
line = line.rstrip()
if '[' in line:
line = line.split('[')[-1]
out_file.write('>' + line[:-1] + "\n")
else:
out_file.write(str(line) + "\n")
but I cannot automate the process for all 120 files in my folder.
I tried using glob.glob, but I cannot get it to work:
import glob
for fasta_file in glob.glob('*.fasta'):
outfile = open('./fastas3/'+fasta_file, 'w')
with open(fasta_file, 'r'):
for line in fasta_file:
line = line.rstrip()
if '[' in line:
line2 = line.split('[')[-1]
outfile.write('>' + line2[:-1] + "\n")
else:
outfile.write(str(line) + "\n")
it gives me this result:
A
c
i
n
e
t
o
b
a
c
t
e
r
.
f
a
s
t
a
I was able to get a list of all files in a folder, but I cannot open certain files using the object in the list.
import os
file_list = []
for file in os.listdir("./fastas2/"):
if file.endswith(".fasta"):
file_list.append(file)
Given that you can now change the contents of the filename, you need to automate the process. We changed the function for a single file by removing the file handler that was used twice to open the file.
def file_changer(filename):
data_to_put = ''
with open(filename, 'r+') as fasta_file:
for line in fasta_file.readlines():
line = line.rstrip()
if '[' in line:
line = line.split('[')[-1]
data_to_put += '>' + str(line[:-1]) + "\n"
else:
data_to_put += str(line) + "\n"
fasta_file.write(data_to_put)
fasta_file.close()
Now we need to iterate over all your files. So let's use the glob
module for it
import glob
for file in glob.glob('*.fasta'):
file_changer(file)
Iterates over the filename, which gives you all the characters in the name instead of the lines of the file. Here is the corrected version of the code:
import glob
for fasta_file_name in glob.glob('*.fasta'):
with open(fasta_file_name, 'r') as fasta_file, \
open('./fastas3/' + fasta_file_name, 'w') as outfile:
for line in fasta_file:
line = line.rstrip()
if '[' in line:
line2 = line.split('[')[-1]
outfile.write('>' + line2[:-1] + "\n")
else:
outfile.write(str(line) + "\n")
Alternatively to a Python script, you can simply use sed
from the command line:
sed -i 's/^>.*\[\(.*\)\].*$/>\1/' *.fasta
This will change all files in place, so copy them first.