Numerical coding of mutated residues and positions

I am writing a python program that is supposed to calculate the numeric encoding of mutated residuals and positions of a rowset. These lines represent sequences of proteins. These sequences are stored in the fasta file and each protein sequence is separated by a comma. Sequence lengths can differ for different proteins. In this I tried to find the position and sequence that are mutated. I used the following code to get this.

a = 'AGFESPKLH'
b = 'KGFEHMKLH'
for i in range(len(a)):
  if a[i] != b[i]:
     print i, a[i], b[i]

      

But I want the sequence file as an input file. The following figure will show you my project. In this figure, the first field is the alignment of the input file sequences. The last field represents the output file. How can I do this in Python? please, help. Thanks everyone for your time.

Example:

input file

MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD




        positions  1  2  3  4  5  6                         1  2  3  4  5  6

protein sequence1  M  T  A  Q  D  D                            T  A     D

protein sequence2  M  T  A  Q  D  D                            T  A     D

protein sequence3  M  T  S  Q  E  D                            T  S     E

protein sequence4  M  T  A  Q  D  D                            T  A     D

protein sequence5  M  K  A  Q  H  D                            K  A     H


     PROTEIN SEQUENCE ALIGNMENT                          DISCARD NON-VARIABLE REGION

      


        positions  2  2  3  3  5  5  5

protein sequence1  T     A     D   

protein sequence2  T     A     D   

protein sequence3  T        S     E

protein sequence4  T     A     D   

protein sequence5     K  A           H

   MUTATED RESIDUE IS SPLITED TO SEPARATE COLUMN

      


The output file should look like this:

position+residue   2T  2K  3A  3S  5D  5E  5H

       sequence1   1   0   1   0   1   0   0

       sequence2   1   0   1   0   1   0   0

       sequence3   1   0   0   1   0   1   0

       sequence4   1   0   1   0   1   0   0

       sequence5   0   1   1   0   0   0   1

    (RESIDUES ARE CODED 1 IF PRESENT, 0 IF ABSENT)

      

+3


source to share


4 answers


If you want to work with tabular data, consider pandas :

from pandas import *

data = 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'

df = DataFrame([list(row) for row in data.split(',')])

print DataFrame({str(col)+val:(df[col]==val).apply(int) 
        for col in df.columns for val in set(df[col])})

      

output:

  0M  1K  1T  2A  2S  3Q  4D  4E  4H  5D
0   1   0   1   1   0   1   1   0   0   1
1   1   0   1   1   0   1   1   0   0   1
2   1   0   1   0   1   1   0   1   0   1
3   1   0   1   1   0   1   1   0   0   1
4   1   1   0   1   0   1   0   0   1   1

      




If you want to remove columns with all:

print df.select(lambda x: not df[x].all(), axis = 1)    

   1K  1T  2A  2S  4D  4E  4H
0   0   1   1   0   1   0   0
1   0   1   1   0   1   0   0
2   0   1   0   1   0   1   0
3   0   1   1   0   1   0   0
4   1   0   1   0   0   0   1

      

+1


source


Something like that?



ls = 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'.split(',')

pos = [set(enumerate(x, 1)) for x in ls]
alle = sorted(set().union(*pos))

print '\t'.join(str(x) + y for x, y in alle)
for p in pos:
    print '\t'.join('1' if key in p else '0' for key in alle)

      

+1


source


protein_sequence = "MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN"

#Parse the file
proteins = protein_sequence.split(",")
#For each protein sequence remove the duplicates
proteins = map(lambda x:"".join(set(list(x))), proteins)

#Create result
result = []
key_set = ['T', 'K', 'A', 'S', 'D', 'E', 'K', 'R', 'D', 'N', 'E', 'Y', 'M', 'L', 'P', 'N', 'Q']
for protein in proteins:
    local_dict = dict(zip(key_set, [0] * len(key_set)))
    #Split the protein in amino acid
    components = list(protein)
    for amino_acid in components:
        local_dict[amino_acid] = 1
    result.append((protein, local_dict))

      

0


source


You can use the pandas function get_dummies

to do most of the heavy lifting:

In [11]: s # a pandas Series (DataFrame column)
Out[11]: 
0    T
1    T
2    T
3    T
4    K
Name: 1

In [12]: pd.get_dummies(s, prefix=s.name, prefix_sep='')
Out[12]: 
   1K  1T
0   0   1
1   0   1
2   0   1
3   0   1
4   1   0

      

To put data into a DataFrame, you can use:

df = pd.DataFrame(map(list, 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'.split(',')))

In [20]: df
Out[20]: 
   0  1  2  3  4  5
0  M  T  A  Q  D  D
1  M  T  A  Q  D  D
2  M  T  S  Q  E  D
3  M  T  A  Q  D  D
4  M  K  A  Q  H  D

      

And to find those columns that have different meanings:

In [21]: (df.ix[0] != df).any()
Out[21]: 
0    False
1     True
2     True
3    False
4     True
5    False

      

Putting it all together:

In [31]: I = df.columns[(df.ix[0] != df).any()]

In [32]: J = (pd.get_dummies(df[i], prefix=df[i].name, prefix_sep='') for i in I)

In [33]: df[[]].join(J)
Out[33]: 
   1K  1T  2A  2S  4D  4E  4H
0   0   1   1   0   1   0   0
1   0   1   1   0   1   0   0
2   0   1   0   1   0   1   0
3   0   1   1   0   1   0   0
4   1   0   1   0   0   0   1

      

0


source







All Articles