Long-index arrays in python

I am trying to shorten the memory length from 10 consecutive integers by referencing them as indices in a boolean array. In other words, I need to create an array of 10,000,000,000 elements, but that's fine in the "Long" range. When I try to reference an array index greater than sys.maxint, the array explodes:

x = [False] * 10000000000
Traceback (most recent call last):
  File "", line 1, in 
    x = [0] * 10000000000
OverflowError: cannot fit 'long' into an index-sized integer

What can I do? I can't find anyone on the net having this problem ... Presumably the answer is "python can't handle arrays larger than 2B".

+2


source to share


6 answers


With a 32-bit address space, any language will try to resolve such an array. The problem then becomes how much real memory you have on your computer.

If you need 10B array elements, each element representing either true or false, then use array.array ('I', ...) ...

container = array.array('I', [0]) * ((10000000000 + 31) // 32)

      



You can then set and clear the bit using normal masking and switching operations.

Alternative:

If only a small number of items are true, or only a small number of items are false, then you can use a best-store memory set or dict for programming convenience.

+5


source


The bitarray package looks like it might be another useful option.



+4


source


A dense bit vector is plausible, but it won't be optimal unless you know that you won't have more than about the 10**10

elements, all clustered next to each other, with a fairly randomized distribution. If you have a different distribution, a different structure is better.

For example, if you know that there are only a few terms in this range [0.10 ** 10), use set()

, or if the opposite is true, with almost every element present except for a fraction, use a negative set, i.e. element not in mySet

...

If the items tend to group around small ranges, you can use a run length encoding, something like [xrange(0,10),xrange(10,15),xrange(15,100)]

you go through by dividing in half until you find a matching range, and if the index is even then the item is in the set. insertions and deletions are related to range shuffling.

If your distribution is really tight, but you need a little more than what fits in memory (seems typical in practice), then you can manage memory by using mmap

and wrapping the mapped file with an adapter that uses a similar mechanism to the suggested suggested solution array('I')

.

To get an idea of ​​how compressible you can get, try creating a simple file with reasonable compressed data content and then apply a general compression algorithm (like gzip) to see how much you see. If there is a significant reduction, then you can probably use some kind of space optimization in your code.

+3


source


10B booleans (1.25MB of memory, assuming Python is normal)

I think your arithmetic is wrong - supercomputer is stored, boolean 10B will be 1.25 GIGA , _not__ MEGA , bytes.

The list is at least 4 bytes per item, so you'll need 40GB to do it the way you want it to.

You can store the array (see module array

in standard library) in much less memory than possible, so it can fit.

+2


source


Another option for very large bit arrays would be to use bitstring . It uses bytearray

(or array.array

for older versions of Python) to store data, but its interface is just an array of bits. For your case, you can use:

>>> from bitstring import BitString
>>> s = BitString(10000000000)         # zero initialised
>>> s.set([9, 999999999, 253])         # set 3 bits to '1'
>>> s[66] = True                       # set another bit
>>> s.allset([9, 66])                  # check if bits are set to '1'
True

      

I think it's preferable to do all the masking and move yourself!

+1


source


From googling around some, for example. PEP 353 (assuming I understand this) and this exchange looks like the real issue is probably platform / system dependent. Do you have enough memory to handle 10,000,000,000 records?

0


source







All Articles