Python: works with bits. Coding nucleotides with ones and zeros
I would like to encode nucleotides "A", "G", "C" and "T" using bit encoding in Python. For example:
'A' = 00
'G' = 01
'C' = 10
'T' = 11
To build a huge dict containing k-mers like:
dic = { 'ATGACTGACT':231, 'AAATGACGGAC':500 ... }
I think this might reduce the amount of memory required for this dict, since "ATGC" will require 4 bytes, but the same word will require 8 bits with bit encoding.
I'm not sure if it can be done and if so how can I do it using Python
Thank you Advance!
EDITED: I'm sorry that I didn't explain myself correctly.
I want to go through a sequence made by "ATGC" with a sliding window of size k and count how many times each k-mer appears in this section. For example:
'ATGAATGAA' # with a sliding window of 5 would be
dic = { 'ATGAA':2, 'TGAAT':1, 'GAATG':1, 'AATGA':1, }
Since I would like to construct a dict with all possible combinations of "AGTC" of size k before starting to read the sequence in order to access that dict with each k-mer as key and sum 1 to its value, I was wondering if it was possible that the k-mers on this disk will be stored bit-encoded. More or less:
dic = {1011001010: 3, 0000110011: 666, ... etc }
I am currently building this dict with itertools.
# k-mers of size 8
{''.join(x):0 for x in itertools.product('ATGC', repeat=8)}
I guess another problem would be that each k-mer would need to be converted to this bit encoding in order to access the dict
source to share
You can convert your cursors to binaries, but as Ignacio points out, you still need to know their length, so you might need to store that as well. So for very long sequences it will still save memory space.
Here is some sample code that takes sequences, encodes them, and decodes them again:
encoding_map = {'A': 0, 'G': 1, 'C': 2, 'T': 3}
decoding_lst = ['A', 'G', 'C', 'T']
def encode(k):
code = 0
for ch in k:
code *= 4
code += encoding_map[ch]
return code, len(k)
def decode(enc):
code, length = enc
ret = ''
for _ in range(length):
index = code & 3
code >>= 2
ret = decoding_lst[index] + ret
return ret
kmers = ['ATGACTGACT', 'ATGC', 'AATGC']
kmerdict = {k: encode(k) for k in kmers}
print(kmerdict)
for key, enc in kmerdict.items():
print(enc, decode(enc))
typical conclusion:
{'AATGC': (54, 5), 'ATGC': (54, 4), 'ATGACTGACT': (215883, 10)}
(54, 5) AATGC
(54, 4) ATGC
(215883, 10) ATGACTGACT
By the way, no matter how long the sequence is, Python needs to be able to handle encoding and decoding because integers expand to enough bits to hold the number.
source to share
This is exactly what you asked for
In [11]: d={'A':'00','G':'01','C':'10','T':'11'}
In [12]: int('0B'+''.join([d[c] for c in 'ATGACTGACT']),2)
Out[12]: 215883
In [13]: int('0B'+''.join([d[c] for c in 'ATGACTGACT'[::-1]]),2)
Out[13]: 925212
In [14]:
but the objections given in their comments by pmod and Ignacio Vazquez-Abrams are really important, I think you should seriously rethink your approach.
source to share
As @gbofi's answer suggests, it is quite easy to convert k-mer to an integer between 0
and 4**k - 1
. An alternative, primarily mathematical way of doing coding would be:
def kmer_to_int(kmer):
return sum(4**i * "ATGC".index(x) for i, x in enumerate(kmer))
I have not tested if this is faster than creating a binary string and then converting it to int.
This code gives the first character in the least significant bit position, so "AT"
becomes 0b0100
, or, 4
and "TA"
becomes 0b0001
or 1
. If you want the encoding to treat the first letters as the most important, use enumerate(reversed(kmer))
instead enumerate(kmer)
in the generator expression.
As others have commented, these integers are only unique for a given length k
. The same integer is indicated as the encoding for strings of different lengths, if they differ only in the number of terminal A
(e.g., "ATG"
, "ATGA"
, "ATGAA"
, "ATGAAA"
and so on until all coded 36
).
As for your broader goal of counting the events of specific k-mers in a larger sequence, I'm not sure if you will see the benefit of coding k-mers this way or not. The benefits may depend on the details of your dataset.
The advantage of whole keys is that they will allow you to use a list rather than a dictionary to store your accounts. You can create the corresponding list with lst = [0] * 4**k
, and then increase the value you saw with lst[kmer_to_int(kmer)] += 1
. Lists have less overhead than dictionaries given the same number of entries, but I'm not sure if the difference would be big enough to be useful.
If your data is sparsely allocated (that is, many of the 4 possible k-ary sequences never appear in your input), using a list can still waste a lot of memory, since the list is always 4**k
long. The best approach might be to use some other techniques to optimize your code dict
for sparse data.
One option is to use some class methods dict
to avoid having to initialize all the values in your result set before 0
. If you change the increment code to do d[key] = d.get(key, 0) + 1
, it will work regardless of whether it was key
already in the dictionary.
Another option is to use collections.Counter
rather than the usual one dict
. The class is Counter
specifically designed to count the instances of elements in the input sequence, which seems to be exactly what you are doing. He believes that any clue he hasn't seen before matters 0
.
source to share