Increment counter as dictionary value in loop

I have a list of several hundred amino acid sequences called aa_seq, it looks like this: ['AFYIVHPMFSELINFQNEGHECQCQCG', 'KVHSLPGMSDNGSPAVLPKTEFNKYKI', 'RAQVEDLMSLSPHVENASIPKGSTPNIPTM]. Each sequence is 27 letters long. I have to determine the most commonly used amino acid for each position (1-27) and at what frequency.

So far I have:

   count_dict = {} 
   counter = count_dict.values()
   aa_list = ['A', 'C', 'D', 'E' ,'F' ,'G' ,'H' ,'I' ,'K' ,'L' ,    #one-letter code for amino acids
       'M' ,'N' ,'P' ,'Q' ,'R' ,'S' ,'T' ,'V' ,'W' ,'Y']
   for p in range(0,26):                       #first round:looks at the first position in each sequence
        for s in range(0,len(aa_seq)):          #goes through all sequences of the list 
             for item in aa_list:                #and checks for the occurrence of each amino acid letter (=item)
                  if item in aa_seq[s][p]:
                      count_dict[item]            #if that letter occurs at the respective position, make it a key in the dictionary
                      counter += 1                #and increase its counter (the value, as definded above) by one 
    print count_dict

      

It says KeyError: "A" and it points to the string count_dict [item]. So the aa_list item apparently cannot be added as a key this way ..? How should I do it? And it also gave the error "int object" is not iterable "relative to the counter. How else can the counter be incremented?

+3


source to share


4 answers


You can use Counter class

>>> from collections import Counter

>>> l = ['AFYIVHPMFSELINFQNEGHECQCQCG', 'KVHSLPGMSDNGSPAVLPKTEFNKYKI', 'RAQVEDLMSLSPHVENASIPKGSTPIP', 'TSTNNYPMVQEQAILSCIEQTMVADAK']
>>> s = [Counter([l[j][i] for j in range(len(l))]).most_common()[0] for i in range(27)]
>>> s
[('A', 1),
 ('A', 1),
 ('Y', 1),
 ('I', 1),
 ('N', 1),
 ('Y', 1),
 ('P', 2),
 ('M', 4),
 ('S', 2),
 ('Q', 1),
 ('E', 2),
 ('Q', 1),
 ('I', 1),
 ('I', 1),
 ('A', 1),
 ('Q', 1),
 ('A', 1),
 ('I', 1),
 ('I', 1),
 ('Q', 1),
 ('E', 2),
 ('C', 1),
 ('Q', 1),
 ('A', 1),
 ('Q', 1),
 ('I', 1),
 ('I', 1)]

      



However, I can be inefficient if you have large datasets.

+5


source


Modified code

Here's a modified working version of your code. It is ineffective, but it should output the correct result.

A few notes:

  • One counter is needed for each index. Therefore, you must initialize your dict inside the first loop.
  • range(0,26)

    has only 26 elements: 0 through 25 (inclusive).
  • defaultdict

    helps to define 0

    for each initial value.
  • you need to increment the counter with count_dict[item] += 1

  • At the end of each cycle, you need to find the key (amino acid) with the highest value (occurrences).

from collections import defaultdict

aa_seq = ['AFYIVHPMFSELINFQNEGHECQCQCG', 'KVHSLPGMSDNGSPAVLPKTEFNKYKI',
          'RAQVEDLMSLSPHVENASIPKGSTPIP', 'TSTNNYPMVQEQAILSCIEQTMVADAK']
aa_list = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L',  # one-letter code for amino acids
           'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']

for p in range(27):                  # first round:looks at the first position in each sequence
    count_dict = defaultdict(int)    # initialize counter with 0 as default value
    for s in range(0, len(aa_seq)):  # goes through all sequences of the list
        # and checks for the occurrence of each amino acid letter (=item)
        for item in aa_list:
            if item in aa_seq[s][p]:
                # if that letter occurs at the respective position, make it a
                # key in the dictionary
                count_dict[item] += 1
    print(max(count_dict.items(), key=lambda x: x[1]))

      

It outputs:



('R', 1)
('S', 1)
('Y', 1)
('S', 1)
('E', 1)
('P', 1)
('P', 2)
('M', 4)
...

      

Alternative with counter

You don't need many loops, you just need to repeat once over each character of each sequence.

Also, there is no need to reinvent the wheel: Counter

and most_common

are better alternatives than defaultdict

and max

.

from collections import Counter

aa_seqs = ['AFYIVHPMFSELINFQNEGHECQCQCG', 'KVHSLPGMSDNGSPAVLPKTEFNKYKI', 'RAQVEDLMSLSPHVENASIPKGSTPIP', 'TSTNNYPMVQEQAILSCIEQTMVADAK']

counters = [Counter() for i in range(27)]

for aa_seq in aa_seqs:
    for (i, aa) in enumerate(aa_seq):
        counters[i][aa] += 1

most_commons = [counter.most_common()[0] for counter in counters]
print(most_commons)

      

It outputs:



<'>' ('' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' ''), ('P', 2), ('M', 4), ('S', 2), ('Q', 1), ('E', 2), ('G', 1), ('H', 1), ('N', 1), ('L', 1), ('N', 1), ('N', 1), (' I ', 1), (' G ', 1), (' H ', 1), (' E ', 2), (' G ', 1), (' N ', 1), (' K ' , 1), ('Y', 1), ('K', 1), ('G', 1)]
+3


source


to add an element to the dictionnary, you must initialize it to a value:

if item not in count_dict:
    count_dict[item]=0

      

you can use the setdefault function to execute it as a one-liner:

count_dict.setdefault(item,0)

      

+2


source


this is how you quickly type items in the dictionary, just add this to whatever you created

count_dict = {} 

aa_list = ['A', 'C', 'D', 'E' ,'F' ,'G' ,'H' ,'I' ,'K' ,'L' ,
       'M' ,'N' ,'P' ,'Q' ,'R' ,'S' ,'T' ,'V' ,'W' ,'Y']

for element in aa_list:
    count_dict[element]=(count_dict).get(element,0)+1

print (count_dict)

      

0


source







All Articles