Numpy.array_split () odd behavior
I am trying to split a large data frame with loop data into smaller data frames equal to or close to the loop length. Array_split worked fine until my data allowed for equal splitting (worked fine with 500,000 cycles, but not 1,190,508). I want the sections to be in 1000 cycle increments (except for the last frame it would be less).
Here's the script:
d = {
'a': pd.Series(random(1190508)),
'b': pd.Series(random(1190508)),
'c': pd.Series(random(1190508)),
}
frame = pd.DataFrame(d)
cycles = 1000
sections = math.ceil(len(frame)/cycles)
split_frames = np.array_split(frame, sections)
The docs show that array_split basically splits even groups when it can, and then makes a smaller group at the end because the data can't be split evenly. This is what I want, but currently, looking at the length of each frame in this new one split_frames list
:
split_len = pd.DataFrame([len(a) for a in split_frame])
split_len.to_csv('lengths.csv')
the first 698 frames are 1000 elements long, and then the rest (frames 699 to 1190) are 999 elements long.
This random break seems to be in length, no matter what number I pass for sections
(rounding, even number, or whatever).
I'm trying to figure out why it doesn't produce equal frame lengths other than the last one, as in the docs:
>>> x = np.arange(8.0)
>>> np.array_split(x, 3)
[array([ 0., 1., 2.]), array([ 3., 4., 5.]), array([ 6., 7.])]
Any help is appreciated, thanks!
source to share
array_split
does not make the number of equal sections and one with the remainders. If you have partitioned an array of length l
into partitions n
, it makes partitions the l % n
size l//n + 1
and the rest of the size l//n
. See source for details . (This really needs to be explained in the docs.)
source to share
As @ user2357112 writes, array_split
doesn't do what you think it does ... but looking at the docs, it's hard to figure out what it does anyway. In fact, I would say its behavior is undefined. We expect it to return something, but we don't know what properties something will have.
To get what you want, I would use the ability to numpy.split
provide custom indexes. So for example:
def greedy_split(arr, n, axis=0):
"""Greedily splits an array into n blocks.
Splits array arr along axis into n blocks such that:
- blocks 1 through n-1 are all the same size
- the sum of all block sizes is equal to arr.shape[axis]
- the last block is nonempty, and not bigger than the other blocks
Intuitively, this "greedily" splits the array along the axis by making
the first blocks as big as possible, then putting the leftovers in the
last block.
"""
length = arr.shape[axis]
# compute the size of each of the first n-1 blocks
block_size = np.ceil(length / float(n))
# the indices at which the splits will occur
ix = np.arange(block_size, length, block_size)
return np.split(arr, ix, axis)
Some examples:
>>> x = np.arange(10)
>>> greedy_split(x, 2)
[array([0, 1, 2, 3, 4]), array([5, 6, 7, 8, 9])]
>>> greedy_split(x, 3)
[array([0, 1, 2, 3]), array([4, 5, 6, 7]), array([8, 9])]
>>> greedy_split(x, 4)
[array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8]), array([9])]
source to share