When numbers are larger than Python sys.maxint, do they require much more memory?
I am iterating over 80m lines in a 2.5gb file to create a list of offsets for the location of the start of each line. The memory grows slowly as expected until I hit the 40m line and then quickly expands by 1.5GB 3-5 seconds before the process exits due to low memory.
After doing some research, I found that the bloat occurs around the time the current offset (curr_offset) is around 2b, which is around my sys.maxint (2 ^ 31-1).
My questions:
- Do numbers larger than sys.maxint require significantly more storage space? If so, why? If not, why should I see this behavior?
- What factors (like which Python, which operating system) determine sys.maxint?
- On my 2010 MacBook Pro using 64-bit Python, sys.maxint is 2 ^ 63-1.
- On my Windows 7 laptop using 64-bit IronPython, sys.maxint is less than 2 ^ 31-1. It's the same with 32-bit Python. For various reasons, I cannot get 64-bit Python on my Windows machine right now.
- Is there a better way to create this list of offsets?
The code in question:
f = open('some_file', 'rb') curr_offset = 0 offsets = [] for line in f: offsets.append(curr_offset) curr_offset += len(line) f.close()
source to share
Here's a solution that even works with 32-bit Python versions: keep the string lengths (they are small), convert to a NumPy array of 64-bit integers, and only then calculate the offsets:
import numpy
with open('some_file', 'rb') as input_file:
lengths = map(len, input_file)
offsets = numpy.array(lengths, dtype=numpy.uint64).cumsum()
where cumsum()
calculates the cumulative sum of line lengths. 80 M lines give out a manageable array of offsets 8 * 80 = 640 MB.
The list building lengths
can even be circumvented by constructing an array of lengths with numpy.fromiter()
:
import numpy
with open('some_file', 'rb') as input_file:
offsets = numpy.fromiter((len(line) for line in input_file), dtype=numpy.uint64).cumsum()
I guess it's harder to find a faster method as using a single numeric type (64-bit integers) makes NumPy arrays faster than Python lists.
source to share
The offset in a 2.5 GB file must not exceed eight bytes. Indeed, the signed 64-bit integer is at most 9223372036854775807 , much more than 2.5G.
If you have 80 million rows, you would need at most 640 MB to store an array of 80M offsets.
I would investigate to see if something is working with your code or Python, perhaps using a different container (perhaps an explicit numpy array of 64 bit integers ), using a preinitialized list, or even another language to store and process your offsets, for example off_t
in C, with appropriate compilation flags.
(If you're curious and want to see the demo code, I wrote a C program called sample
that stores 64-bit offsets to newlines in the input file so that you can sample the collector on a scale larger than GNU sort
.)
source to share
If you really can't use 64-bit Python, you can hold the calculated offsets in a NumPy array from a number numpy.uint64
(max value 2 ** 64-1). This is a little awkward because you need to dynamically expand the array when it reaches capacity, but this will solve your problem and will work on any version of Python.
PS: my other answer gives a nicer NumPy based solution that doesn't require dynamically updating the size of the NumPy offset array.
source to share
Adding to the list reallocates the buffer for the list as soon as it passes the capacity of the current buffer. I don't know specifically how Python works, but a common method is to allocate 1.5x to 2x the previous buffer size. This is an exponential operation, so it's normal to see memory requirements grow rapidly towards the end. The list may be too large; a quick test would be to replace curr_offset += len(line)
with curr_offset += 1
and see if you have the same behavior.
source to share