Increment counter as dictionary value in loop
I have a list of several hundred amino acid sequences called aa_seq, it looks like this: ['AFYIVHPMFSELINFQNEGHECQCQCG', 'KVHSLPGMSDNGSPAVLPKTEFNKYKI', 'RAQVEDLMSLSPHVENASIPKGSTPNIPTM]. Each sequence is 27 letters long. I have to determine the most commonly used amino acid for each position (1-27) and at what frequency.
So far I have:
count_dict = {}
counter = count_dict.values()
aa_list = ['A', 'C', 'D', 'E' ,'F' ,'G' ,'H' ,'I' ,'K' ,'L' , #one-letter code for amino acids
'M' ,'N' ,'P' ,'Q' ,'R' ,'S' ,'T' ,'V' ,'W' ,'Y']
for p in range(0,26): #first round:looks at the first position in each sequence
for s in range(0,len(aa_seq)): #goes through all sequences of the list
for item in aa_list: #and checks for the occurrence of each amino acid letter (=item)
if item in aa_seq[s][p]:
count_dict[item] #if that letter occurs at the respective position, make it a key in the dictionary
counter += 1 #and increase its counter (the value, as definded above) by one
print count_dict
It says KeyError: "A" and it points to the string count_dict [item]. So the aa_list item apparently cannot be added as a key this way ..? How should I do it? And it also gave the error "int object" is not iterable "relative to the counter. How else can the counter be incremented?
source to share
You can use Counter class
>>> from collections import Counter
>>> l = ['AFYIVHPMFSELINFQNEGHECQCQCG', 'KVHSLPGMSDNGSPAVLPKTEFNKYKI', 'RAQVEDLMSLSPHVENASIPKGSTPIP', 'TSTNNYPMVQEQAILSCIEQTMVADAK']
>>> s = [Counter([l[j][i] for j in range(len(l))]).most_common()[0] for i in range(27)]
>>> s
[('A', 1),
('A', 1),
('Y', 1),
('I', 1),
('N', 1),
('Y', 1),
('P', 2),
('M', 4),
('S', 2),
('Q', 1),
('E', 2),
('Q', 1),
('I', 1),
('I', 1),
('A', 1),
('Q', 1),
('A', 1),
('I', 1),
('I', 1),
('Q', 1),
('E', 2),
('C', 1),
('Q', 1),
('A', 1),
('Q', 1),
('I', 1),
('I', 1)]
However, I can be inefficient if you have large datasets.
source to share
Modified code
Here's a modified working version of your code. It is ineffective, but it should output the correct result.
A few notes:
- One counter is needed for each index. Therefore, you must initialize your dict inside the first loop.
-
range(0,26)
has only 26 elements: 0 through 25 (inclusive). -
defaultdict
helps to define0
for each initial value. - you need to increment the counter with
count_dict[item] += 1
- At the end of each cycle, you need to find the key (amino acid) with the highest value (occurrences).
from collections import defaultdict
aa_seq = ['AFYIVHPMFSELINFQNEGHECQCQCG', 'KVHSLPGMSDNGSPAVLPKTEFNKYKI',
'RAQVEDLMSLSPHVENASIPKGSTPIP', 'TSTNNYPMVQEQAILSCIEQTMVADAK']
aa_list = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', # one-letter code for amino acids
'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']
for p in range(27): # first round:looks at the first position in each sequence
count_dict = defaultdict(int) # initialize counter with 0 as default value
for s in range(0, len(aa_seq)): # goes through all sequences of the list
# and checks for the occurrence of each amino acid letter (=item)
for item in aa_list:
if item in aa_seq[s][p]:
# if that letter occurs at the respective position, make it a
# key in the dictionary
count_dict[item] += 1
print(max(count_dict.items(), key=lambda x: x[1]))
It outputs:
('R', 1)
('S', 1)
('Y', 1)
('S', 1)
('E', 1)
('P', 1)
('P', 2)
('M', 4)
...
Alternative with counter
You don't need many loops, you just need to repeat once over each character of each sequence.
Also, there is no need to reinvent the wheel: Counter
and most_common
are better alternatives than defaultdict
and max
.
from collections import Counter
aa_seqs = ['AFYIVHPMFSELINFQNEGHECQCQCG', 'KVHSLPGMSDNGSPAVLPKTEFNKYKI', 'RAQVEDLMSLSPHVENASIPKGSTPIP', 'TSTNNYPMVQEQAILSCIEQTMVADAK']
counters = [Counter() for i in range(27)]
for aa_seq in aa_seqs:
for (i, aa) in enumerate(aa_seq):
counters[i][aa] += 1
most_commons = [counter.most_common()[0] for counter in counters]
print(most_commons)
It outputs:
<'>' ('' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' '' ''), ('P', 2), ('M', 4), ('S', 2), ('Q', 1), ('E', 2), ('G', 1), ('H', 1), ('N', 1), ('L', 1), ('N', 1), ('N', 1), (' I ', 1), (' G ', 1), (' H ', 1), (' E ', 2), (' G ', 1), (' N ', 1), (' K ' , 1), ('Y', 1), ('K', 1), ('G', 1)]
source to share
to add an element to the dictionnary, you must initialize it to a value:
if item not in count_dict:
count_dict[item]=0
you can use the setdefault function to execute it as a one-liner:
count_dict.setdefault(item,0)
source to share
this is how you quickly type items in the dictionary, just add this to whatever you created
count_dict = {}
aa_list = ['A', 'C', 'D', 'E' ,'F' ,'G' ,'H' ,'I' ,'K' ,'L' ,
'M' ,'N' ,'P' ,'Q' ,'R' ,'S' ,'T' ,'V' ,'W' ,'Y']
for element in aa_list:
count_dict[element]=(count_dict).get(element,0)+1
print (count_dict)
source to share