Numerical coding of mutated residues and positions
I am writing a python program that is supposed to calculate the numeric encoding of mutated residuals and positions of a rowset. These lines represent sequences of proteins. These sequences are stored in the fasta file and each protein sequence is separated by a comma. Sequence lengths can differ for different proteins. In this I tried to find the position and sequence that are mutated. I used the following code to get this.
a = 'AGFESPKLH'
b = 'KGFEHMKLH'
for i in range(len(a)):
if a[i] != b[i]:
print i, a[i], b[i]
But I want the sequence file as an input file. The following figure will show you my project. In this figure, the first field is the alignment of the input file sequences. The last field represents the output file. How can I do this in Python? please, help. Thanks everyone for your time.
Example:
input file
MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD
positions 1 2 3 4 5 6 1 2 3 4 5 6
protein sequence1 M T A Q D D T A D
protein sequence2 M T A Q D D T A D
protein sequence3 M T S Q E D T S E
protein sequence4 M T A Q D D T A D
protein sequence5 M K A Q H D K A H
PROTEIN SEQUENCE ALIGNMENT DISCARD NON-VARIABLE REGION
positions 2 2 3 3 5 5 5
protein sequence1 T A D
protein sequence2 T A D
protein sequence3 T S E
protein sequence4 T A D
protein sequence5 K A H
MUTATED RESIDUE IS SPLITED TO SEPARATE COLUMN
The output file should look like this:
position+residue 2T 2K 3A 3S 5D 5E 5H
sequence1 1 0 1 0 1 0 0
sequence2 1 0 1 0 1 0 0
sequence3 1 0 0 1 0 1 0
sequence4 1 0 1 0 1 0 0
sequence5 0 1 1 0 0 0 1
(RESIDUES ARE CODED 1 IF PRESENT, 0 IF ABSENT)
source to share
If you want to work with tabular data, consider pandas :
from pandas import *
data = 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'
df = DataFrame([list(row) for row in data.split(',')])
print DataFrame({str(col)+val:(df[col]==val).apply(int)
for col in df.columns for val in set(df[col])})
output:
0M 1K 1T 2A 2S 3Q 4D 4E 4H 5D
0 1 0 1 1 0 1 1 0 0 1
1 1 0 1 1 0 1 1 0 0 1
2 1 0 1 0 1 1 0 1 0 1
3 1 0 1 1 0 1 1 0 0 1
4 1 1 0 1 0 1 0 0 1 1
If you want to remove columns with all:
print df.select(lambda x: not df[x].all(), axis = 1)
1K 1T 2A 2S 4D 4E 4H
0 0 1 1 0 1 0 0
1 0 1 1 0 1 0 0
2 0 1 0 1 0 1 0
3 0 1 1 0 1 0 0
4 1 0 1 0 0 0 1
source to share
protein_sequence = "MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN"
#Parse the file
proteins = protein_sequence.split(",")
#For each protein sequence remove the duplicates
proteins = map(lambda x:"".join(set(list(x))), proteins)
#Create result
result = []
key_set = ['T', 'K', 'A', 'S', 'D', 'E', 'K', 'R', 'D', 'N', 'E', 'Y', 'M', 'L', 'P', 'N', 'Q']
for protein in proteins:
local_dict = dict(zip(key_set, [0] * len(key_set)))
#Split the protein in amino acid
components = list(protein)
for amino_acid in components:
local_dict[amino_acid] = 1
result.append((protein, local_dict))
source to share
You can use the pandas function get_dummies
to do most of the heavy lifting:
In [11]: s # a pandas Series (DataFrame column)
Out[11]:
0 T
1 T
2 T
3 T
4 K
Name: 1
In [12]: pd.get_dummies(s, prefix=s.name, prefix_sep='')
Out[12]:
1K 1T
0 0 1
1 0 1
2 0 1
3 0 1
4 1 0
To put data into a DataFrame, you can use:
df = pd.DataFrame(map(list, 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'.split(',')))
In [20]: df
Out[20]:
0 1 2 3 4 5
0 M T A Q D D
1 M T A Q D D
2 M T S Q E D
3 M T A Q D D
4 M K A Q H D
And to find those columns that have different meanings:
In [21]: (df.ix[0] != df).any()
Out[21]:
0 False
1 True
2 True
3 False
4 True
5 False
Putting it all together:
In [31]: I = df.columns[(df.ix[0] != df).any()]
In [32]: J = (pd.get_dummies(df[i], prefix=df[i].name, prefix_sep='') for i in I)
In [33]: df[[]].join(J)
Out[33]:
1K 1T 2A 2S 4D 4E 4H
0 0 1 1 0 1 0 0
1 0 1 1 0 1 0 0
2 0 1 0 1 0 1 0
3 0 1 1 0 1 0 0
4 1 0 1 0 0 0 1
source to share