Numerical coding of mutated residues and positions

Question

Numerical coding of mutated residues and positions

I am writing a python program that is supposed to calculate the numeric encoding of mutated residuals and positions of a rowset. These lines represent sequences of proteins. These sequences are stored in the fasta file and each protein sequence is separated by a comma. Sequence lengths can differ for different proteins. In this I tried to find the position and sequence that are mutated. I used the following code to get this.

a = 'AGFESPKLH'
b = 'KGFEHMKLH'
for i in range(len(a)):
  if a[i] != b[i]:
     print i, a[i], b[i]

But I want the sequence file as an input file. The following figure will show you my project. In this figure, the first field is the alignment of the input file sequences. The last field represents the output file. How can I do this in Python? please, help. Thanks everyone for your time.

Example:

input file

MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD




        positions  1  2  3  4  5  6                         1  2  3  4  5  6

protein sequence1  M  T  A  Q  D  D                            T  A     D

protein sequence2  M  T  A  Q  D  D                            T  A     D

protein sequence3  M  T  S  Q  E  D                            T  S     E

protein sequence4  M  T  A  Q  D  D                            T  A     D

protein sequence5  M  K  A  Q  H  D                            K  A     H


     PROTEIN SEQUENCE ALIGNMENT                          DISCARD NON-VARIABLE REGION

        positions  2  2  3  3  5  5  5

protein sequence1  T     A     D   

protein sequence2  T     A     D   

protein sequence3  T        S     E

protein sequence4  T     A     D   

protein sequence5     K  A           H

   MUTATED RESIDUE IS SPLITED TO SEPARATE COLUMN

The output file should look like this:

position+residue   2T  2K  3A  3S  5D  5E  5H

       sequence1   1   0   1   0   1   0   0

       sequence2   1   0   1   0   1   0   0

       sequence3   1   0   0   1   0   1   0

       sequence4   1   0   1   0   1   0   0

       sequence5   0   1   1   0   0   0   1

    (RESIDUES ARE CODED 1 IF PRESENT, 0 IF ABSENT)

+3

python alignment

naz Jan 24 At 10:11

source to share

4 answers

Something like that?

ls = 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'.split(',')

pos = [set(enumerate(x, 1)) for x in ls]
alle = sorted(set().union(*pos))

print '\t'.join(str(x) + y for x, y in alle)
for p in pos:
    print '\t'.join('1' if key in p else '0' for key in alle)

+1

georg Jan 24 At 10:28

source to share

protein_sequence = "MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN"

#Parse the file
proteins = protein_sequence.split(",")
#For each protein sequence remove the duplicates
proteins = map(lambda x:"".join(set(list(x))), proteins)

#Create result
result = []
key_set = ['T', 'K', 'A', 'S', 'D', 'E', 'K', 'R', 'D', 'N', 'E', 'Y', 'M', 'L', 'P', 'N', 'Q']
for protein in proteins:
    local_dict = dict(zip(key_set, [0] * len(key_set)))
    #Split the protein in amino acid
    components = list(protein)
    for amino_acid in components:
        local_dict[amino_acid] = 1
    result.append((protein, local_dict))

0

Ketouem Jan 30 13 at 12:31

source to share

You can use the pandas function get_dummies

to do most of the heavy lifting:

In [11]: s # a pandas Series (DataFrame column)
Out[11]: 
0    T
1    T
2    T
3    T
4    K
Name: 1

In [12]: pd.get_dummies(s, prefix=s.name, prefix_sep='')
Out[12]: 
   1K  1T
0   0   1
1   0   1
2   0   1
3   0   1
4   1   0

To put data into a DataFrame, you can use:

df = pd.DataFrame(map(list, 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'.split(',')))

In [20]: df
Out[20]: 
   0  1  2  3  4  5
0  M  T  A  Q  D  D
1  M  T  A  Q  D  D
2  M  T  S  Q  E  D
3  M  T  A  Q  D  D
4  M  K  A  Q  H  D

And to find those columns that have different meanings:

In [21]: (df.ix[0] != df).any()
Out[21]: 
0    False
1     True
2     True
3    False
4     True
5    False

Putting it all together:

In [31]: I = df.columns[(df.ix[0] != df).any()]

In [32]: J = (pd.get_dummies(df[i], prefix=df[i].name, prefix_sep='') for i in I)

In [33]: df[[]].join(J)
Out[33]: 
   1K  1T  2A  2S  4D  4E  4H
0   0   1   1   0   1   0   0
1   0   1   1   0   1   0   0
2   0   1   0   1   0   1   0
3   0   1   1   0   1   0   0
4   1   0   1   0   0   0   1

0

Andy Hayden 01 Feb At 15:36

source to share

root · Accepted Answer · 2013-01-24T11:23:21+0000

If you want to work with tabular data, consider pandas :

from pandas import *

data = 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'

df = DataFrame([list(row) for row in data.split(',')])

print DataFrame({str(col)+val:(df[col]==val).apply(int) 
        for col in df.columns for val in set(df[col])})

output:

  0M  1K  1T  2A  2S  3Q  4D  4E  4H  5D
0   1   0   1   1   0   1   1   0   0   1
1   1   0   1   1   0   1   1   0   0   1
2   1   0   1   0   1   1   0   1   0   1
3   1   0   1   1   0   1   1   0   0   1
4   1   1   0   1   0   1   0   0   1   1

If you want to remove columns with all:

print df.select(lambda x: not df[x].all(), axis = 1)    

   1K  1T  2A  2S  4D  4E  4H
0   0   1   1   0   1   0   0
1   0   1   1   0   1   0   0
2   0   1   0   1   0   1   0
3   0   1   1   0   1   0   0
4   1   0   1   0   0   0   1

Numerical coding of mutated residues and positions

More articles: