Protein sequence encoding

Question

Protein sequence encoding

I am working on a python program to calculate the numeric encoding of mutated residues and positions of a set of strings (protein sequence) stored in fasta file, each protein sequence is comma separated. I am trying to find the position and sequences that are mutating.

My fasta file looks like this:

MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN

Example:
The following figure (based on another set of fasta files) will explain the algorithm behind this. In this figure, the first field is the alignment of the input file sequences. The last field represents the output file. How can I do this with a fasta file in Python?

example input file:

MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD


        positions  1  2  3  4  5  6                  1  2  3  4  5  6    
protein sequence1  M  T  A  Q  D  D                     T  A     D
protein sequence2  M  T  A  Q  D  D                     T  A     D    
protein sequence3  M  T  S  Q  E  D                     T  S     E    
protein sequence4  M  T  A  Q  D  D                     T  A     D    
protein sequence5  M  K  A  Q  H  D                     K  A     H

     PROTEIN SEQUENCE ALIGNMENT                   DISCARD NON-VARIABLE REGION    

        positions  2  2  3  3  5  5  5    
protein sequence1  T     A     D       
protein sequence2  T     A     D       
protein sequence3  T        S     E    
protein sequence4  T     A     D       
protein sequence5     K  A           H

MISSING RESIDENT SPLITED SEPARATE COLUMN

The output file should look like this:

position+residue   2T  2K  3A  3S  5D  5E  5H    
       sequence1   1   0   1   0   1   0   0    
       sequence2   1   0   1   0   1   0   0    
       sequence3   1   0   0   1   0   1   0    
       sequence4   1   0   1   0   1   0   0    
       sequence5   0   1   1   0   0   0   1

    (RESIDUES ARE CODED 1 IF PRESENT, 0 IF ABSENT)

Here are the two ways I tried to do:

ls= 'MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN'.split(',')
pos = [set(enumerate(x, 1)) for x in ls]
a=set().union(*pos)
alle = sorted(set().union(*pos))
print '\t'.join(str(x) + y for x, y in alle)
for p in pos:
    print '\t'.join('1' if key in p else '0' for key in alle)

(here I am getting columns of mutated and non-mutated residuals, but I only need columns for mutated residuals)

from pandas import *
data = 'MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN'  
df = DataFrame([list(row) for row in data.split(',')])
df = DataFrame({str(col+1)+val:(df[col]==val).apply(int) for col in df.columns for val in set(df[col])})
print df.select(lambda x: not df[x].all(), axis = 1)

(here it gives output, but not okay, i.e. first 2K, then 2T, then 3A, like this.)

How should I do it?

+3

python pandas

user2020442 01 Feb At 4:38 am

source to share

1 answer

Andy Hayden · Accepted Answer · 2013-02-01T06:57:08+0000

The function get_dummies

will take you most of the way:

In [11]: s
Out[11]: 
0    T
1    T
2    T
3    T
4    K
Name: 1

In [12]: pd.get_dummies(s, prefix=s.name, prefix_sep='')
Out[12]: 
   1K  1T
0   0   1
1   0   1
2   0   1
3   0   1
4   1   0

And those columns that have different meanings:

In [21]: (df.ix[0] != df).any()
Out[21]: 
0    False
1     True
2     True
3    False
4     True
5    False

Combining them:

In [31]: I = df.columns[(df.ix[0] != df).any()]

In [32]: J = [pd.get_dummies(df[i], prefix=df[i].name, prefix_sep='') for i in I]

In [33]: df[[]].join(J)
Out[33]: 
   1K  1T  2A  2S  4D  4E  4H
0   0   1   1   0   1   0   0
1   0   1   1   0   1   0   0
2   0   1   0   1   0   1   0
3   0   1   1   0   1   0   0
4   1   0   1   0   0   0   1

Note. I created the original DataFrame as follows, however it might be more efficient depending on your situation:

df = pd.DataFrame(map(list, 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'.split(',')))

Protein sequence encoding

More articles: