How can I bitwise or truncate along the axis of a boolean array in Julia?
I'm trying to find the best way to do bitwise or reduce a 3D boolean array of masks to 2D in Julia.
I can always write a for loop, of course:
x = randbool(3,3,3)
out = copy(x[:,:,1])
for i = 1:3
for j = 1:3
for k = 2:3
out[i,j] |= x[i,j,k]
end
end
end
But I'm wondering if there is a better way to do the reduction.
source to share
The simple answer would be
out = x[:,:,1] | x[:,:,2] | x[:,:,3]
but i did some benchmarking:
function simple(n,x)
out = x[:,:,1] | x[:,:,2]
for k = 3:n
@inbounds out |= x[:,:,k]
end
return out
end
function forloops(n,x)
out = copy(x[:,:,1])
for i = 1:n
for j = 1:n
for k = 2:n
@inbounds out[i,j] |= x[i,j,k]
end
end
end
return out
end
function forloopscolfirst(n,x)
out = copy(x[:,:,1])
for j = 1:n
for i = 1:n
for k = 2:n
@inbounds out[i,j] |= x[i,j,k]
end
end
end
return out
end
shorty(n,x) = |([x[:,:,i] for i in 1:n]...)
timholy(n,x) = any(x,3)
function runtest(n)
x = randbool(n,n,n)
@time out1 = simple(n,x)
@time out2 = forloops(n,x)
@time out3 = forloopscolfirst(n,x)
@time out4 = shorty(n,x)
@time out5 = timholy(n,x)
println(all(out1 .== out2))
println(all(out1 .== out3))
println(all(out1 .== out4))
println(all(out1 .== out5))
end
runtest(3)
runtest(500)
which gave the following results
# For 500
simple: 0.039403016 seconds (39716840 bytes allocated)
forloops: 6.259421683 seconds (77504 bytes allocated)
forloopscolfirst 1.809124505 seconds (77504 bytes allocated)
shorty: elapsed time: 0.050384062 seconds (39464608 bytes allocated)
timholy: 2.396887396 seconds (31784 bytes allocated)
So, I would go with simple
orshorty
source to share
There are various standard optimization tricks and hints that can be applied, but the critical point to make here is that Julia organizes the array in a column-to-column rather than a row-large order. For small-sized arrays, it is not easy to see, but as the arrays grow, it says. There is a method to reduce , provided it is optimized to execute a function on the collection (in this case, OR ), but this comes with a cost. If the number of step combinations is relatively small, it is best to simply loop. In all cases, minimizing the amount of memory access is better. Below are various optimization attempts with these two things in mind.
Various attempts and observations
Initial function
Here's a function that takes your example and summarizes it.
function boolReduce1(x)
out = copy(x[:,:,1])
for i = 1:size(x,1)
for j = 1:size(x,2)
for k = 2:size(x,3)
out[i,j] |= x[i,j,k]
end
end
end
out
end
By creating a fairly large array, we can execute it for a while
julia> @time boolReduce1(b);
elapsed time: 42.372058096 seconds (1056704 bytes allocated)
Applying optimizations
Here's another similar version, but with standard types, uses @inbounds and inverts loops.
function boolReduce2(b::BitArray{3})
a = BitArray{2}(size(b)[1:2]...)
for j = 1:size(b,2)
for i = 1:size(b,1)
@inbounds a[i,j] = b[i,j,1]
for k = 2:size(b,3)
@inbounds a[i,j] |= b[i,j,k]
end
end
end
a
end
And take the time
julia> @time boolReduce2(b);
elapsed time: 12.892392891 seconds (500520 bytes allocated)
Insight
The second function is much faster and also less memory allocated because the temporary array was not created. But what if we just take the first function and invert the array indexing?
function boolReduce3(x)
out = copy(x[:,:,1])
for j = 1:size(x,2)
for i = 1:size(x,1)
for k = 2:size(x,3)
out[i,j] |= x[i,j,k]
end
end
end
out
end
and take the time
julia> @time boolReduce3(b);
elapsed time: 12.451501749 seconds (1056704 bytes allocated)
It is as fast as the second function.
Using abbreviation
There is a reduce function that we can use to eliminate the 3rd cycle. Its function is to reapply an operation to all elements with the result of the previous operation. This is exactly what we want.
function boolReduce4(b)
a = BitArray{2}(size(b)[1:2]...)
for j = 1:size(b,2)
for i = 1:size(b,1)
@inbounds a[i,j] = reduce(|,b[i,j,:])
end
end
a
end
Now take your time
julia> @time boolReduce4(b);
elapsed time: 15.828273008 seconds (1503092520 bytes allocated, 4.07% gc time)
This is fine, but not as fast as the simple optimized original. The reason is to look at all the extra memory that has been allocated. This is because data has to be copied from all over in order to create input data to shrink.
Combining things
But what if we are as exhaustive as possible. Instead of decreasing the last index, the first one is?
function boolReduceX(b)
a = BitArray{2}(size(b)[2:3]...)
for j = 1:size(b,3)
for i = 1:size(b,2)
@inbounds a[i,j] = reduce(|,b[:,i,j])
end
end
a
end
And now create a similar array and time it.
julia> c = randbool(200,2000,2000);
julia> @time boolReduceX(c);
elapsed time: 1.877547669 seconds (927092520 bytes allocated, 21.66% gc time)
The result is 20 times faster than the original version for large arrays. Pretty good.
But what if the average size?
If the size is very large then the above function looks best, but if the size of the dataset is smaller the use of the reduction is not paid back enough and the next is faster. Including a temporary variable makes things faster since version 2. Another version of boolReduceX, using a loop instead of a shortcut (not shown here), was even faster.
function boolReduce5(b)
a = BitArray{2}(size(b)[1:2]...)
for j = 1:size(b,2)
for i = 1:size(b,1)
@inbounds t = b[i,j,1]
for k = 2:size(b,3)
@inbounds t |= b[i,j,k]
end
@inbounds a[i,j] = t
end
end
a
end
julia> b = randbool(2000,2000,20);
julia> c = randbool(20,2000,2000);
julia> @time boolReduceX(c);
elapsed time: 1.535334322 seconds (799092520 bytes allocated, 23.79% gc time)
julia> @time boolReduce5(b);
elapsed time: 0.491410981 seconds (500520 bytes allocated)
source to share
Degenerate faster. It's just a matter of what kind of work you want to put in. The naive detected approach is slow because BitArray: extracting contiguous regions and bitwise OR can be done 64-bit at the same time, but the naive detected approach works an element at a time. In addition, BitArrays indexing is slow because there is a sequence of bit operations and because it cannot be embedded now due to bounds checking. Here's a strategy, deterministic, but using a BitArray structure. Most of the code is copied from copy_chunks! in bitarray.jl and I haven't tried to remove it (sorry!).
function devec(n::Int, x::BitArray)
src = x.chunks
out = falses(n, n)
dest = out.chunks
numbits = n*n
kd0 = 1
ld0 = 0
for j = 1:n
pos_s = (n*n)*(j-1)+1
kd1, ld1 = Base.get_chunks_id(numbits - 1)
ks0, ls0 = Base.get_chunks_id(pos_s)
ks1, ls1 = Base.get_chunks_id(pos_s + numbits - 1)
delta_kd = kd1 - kd0
delta_ks = ks1 - ks0
u = Base._msk64
if delta_kd == 0
msk_d0 = ~(u << ld0) | (u << (ld1+1))
else
msk_d0 = ~(u << ld0)
msk_d1 = (u << (ld1+1))
end
if delta_ks == 0
msk_s0 = (u << ls0) & ~(u << (ls1+1))
else
msk_s0 = (u << ls0)
end
chunk_s0 = Base.glue_src_bitchunks(src, ks0, ks1, msk_s0, ls0)
dest[kd0] |= (dest[kd0] & msk_d0) | ((chunk_s0 << ld0) & ~msk_d0)
delta_kd == 0 && continue
for i = 1 : kd1 - kd0
chunk_s1 = Base.glue_src_bitchunks(src, ks0 + i, ks1, msk_s0, ls0)
chunk_s = (chunk_s0 >>> (64 - ld0)) | (chunk_s1 << ld0)
dest[kd0 + i] |= chunk_s
chunk_s0 = chunk_s1
end
end
out
end
With Iain tests, this gives me:
simple: 0.051321131 seconds (46356000 bytes allocated, 30.03% gc time)
forloops: 6.226652258 seconds (92976 bytes allocated)
forloopscolfirst: 2.099381939 seconds (89472 bytes allocated)
shorty: 0.060194226 seconds (46387760 bytes allocated, 36.27% gc time)
timholy: 2.464298752 seconds (31784 bytes allocated)
devec: 0.008734413 seconds (31472 bytes allocated)
source to share