How can I bitwise or truncate along the axis of a boolean array in Julia?

I'm trying to find the best way to do bitwise or reduce a 3D boolean array of masks to 2D in Julia.

I can always write a for loop, of course:

x = randbool(3,3,3)
out = copy(x[:,:,1])
for i = 1:3
    for j = 1:3
        for k = 2:3
            out[i,j] |= x[i,j,k]
        end
    end
end

      

But I'm wondering if there is a better way to do the reduction.

+3


source to share


4 answers


The simple answer would be

out = x[:,:,1] | x[:,:,2] | x[:,:,3]

      

but i did some benchmarking:

function simple(n,x)
    out = x[:,:,1] | x[:,:,2]
    for k = 3:n
        @inbounds out |= x[:,:,k]
    end
    return out
end

function forloops(n,x)
    out = copy(x[:,:,1])
    for i = 1:n
        for j = 1:n
            for k = 2:n
                @inbounds out[i,j] |= x[i,j,k]
            end
        end
    end
    return out
end

function forloopscolfirst(n,x)
    out = copy(x[:,:,1])
    for j = 1:n
        for i = 1:n
            for k = 2:n
                @inbounds out[i,j] |= x[i,j,k]
            end
        end
    end
    return out
end

shorty(n,x) = |([x[:,:,i] for i in 1:n]...)

timholy(n,x) = any(x,3)

function runtest(n)
    x = randbool(n,n,n)

    @time out1 = simple(n,x)
    @time out2 = forloops(n,x)
    @time out3 = forloopscolfirst(n,x)
    @time out4 = shorty(n,x)
    @time out5 = timholy(n,x)

    println(all(out1 .== out2))
    println(all(out1 .== out3))
    println(all(out1 .== out4))
    println(all(out1 .== out5))
end

runtest(3)
runtest(500)

      



which gave the following results

# For 500
simple: 0.039403016 seconds (39716840 bytes allocated)
forloops: 6.259421683 seconds (77504 bytes allocated)
forloopscolfirst 1.809124505 seconds (77504 bytes allocated)
shorty: elapsed time: 0.050384062 seconds (39464608 bytes allocated)
timholy: 2.396887396 seconds (31784 bytes allocated)

      

So, I would go with simple

orshorty

+2


source


There are various standard optimization tricks and hints that can be applied, but the critical point to make here is that Julia organizes the array in a column-to-column rather than a row-large order. For small-sized arrays, it is not easy to see, but as the arrays grow, it says. There is a method to reduce , provided it is optimized to execute a function on the collection (in this case, OR ), but this comes with a cost. If the number of step combinations is relatively small, it is best to simply loop. In all cases, minimizing the amount of memory access is better. Below are various optimization attempts with these two things in mind.

Various attempts and observations

Initial function

Here's a function that takes your example and summarizes it.

function boolReduce1(x)
   out = copy(x[:,:,1])
   for i = 1:size(x,1)
       for j = 1:size(x,2)
           for k = 2:size(x,3)
               out[i,j] |= x[i,j,k]
           end
       end
   end
   out
end

      

By creating a fairly large array, we can execute it for a while

julia> @time boolReduce1(b); 
elapsed time: 42.372058096 seconds (1056704 bytes allocated)

      

Applying optimizations

Here's another similar version, but with standard types, uses @inbounds and inverts loops.

function boolReduce2(b::BitArray{3})
   a = BitArray{2}(size(b)[1:2]...)
   for j = 1:size(b,2)
      for i = 1:size(b,1)
         @inbounds a[i,j] = b[i,j,1]
         for k = 2:size(b,3)
            @inbounds a[i,j] |= b[i,j,k]
         end
      end
   end
   a
end

      

And take the time

julia> @time boolReduce2(b);
elapsed time: 12.892392891 seconds (500520 bytes allocated)

      

Insight

The second function is much faster and also less memory allocated because the temporary array was not created. But what if we just take the first function and invert the array indexing?

function boolReduce3(x)
   out = copy(x[:,:,1])
   for j = 1:size(x,2)
       for i = 1:size(x,1)
           for k = 2:size(x,3)
               out[i,j] |= x[i,j,k]
           end
       end
   end
   out
end

      

and take the time

julia> @time boolReduce3(b);
elapsed time: 12.451501749 seconds (1056704 bytes allocated)

      



It is as fast as the second function.

Using abbreviation

There is a reduce function that we can use to eliminate the 3rd cycle. Its function is to reapply an operation to all elements with the result of the previous operation. This is exactly what we want.

function boolReduce4(b)
   a = BitArray{2}(size(b)[1:2]...)
   for j = 1:size(b,2)
      for i = 1:size(b,1)
         @inbounds a[i,j] = reduce(|,b[i,j,:])
     end
   end
   a
end

      

Now take your time

julia> @time boolReduce4(b);
elapsed time: 15.828273008 seconds (1503092520 bytes allocated, 4.07% gc time)

      

This is fine, but not as fast as the simple optimized original. The reason is to look at all the extra memory that has been allocated. This is because data has to be copied from all over in order to create input data to shrink.

Combining things

But what if we are as exhaustive as possible. Instead of decreasing the last index, the first one is?

function boolReduceX(b)
   a = BitArray{2}(size(b)[2:3]...)
   for j = 1:size(b,3)
      for i = 1:size(b,2)
         @inbounds a[i,j] = reduce(|,b[:,i,j])
      end
   end
   a
end

      

And now create a similar array and time it.

julia> c = randbool(200,2000,2000);

julia> @time boolReduceX(c);
elapsed time: 1.877547669 seconds (927092520 bytes allocated, 21.66% gc time)

      

The result is 20 times faster than the original version for large arrays. Pretty good.

But what if the average size?

If the size is very large then the above function looks best, but if the size of the dataset is smaller the use of the reduction is not paid back enough and the next is faster. Including a temporary variable makes things faster since version 2. Another version of boolReduceX, using a loop instead of a shortcut (not shown here), was even faster.

function boolReduce5(b)
   a = BitArray{2}(size(b)[1:2]...)
   for j = 1:size(b,2)
      for i = 1:size(b,1)
         @inbounds t = b[i,j,1]
         for k = 2:size(b,3)
            @inbounds t |= b[i,j,k]
         end
         @inbounds a[i,j] = t
      end
   end
   a
end

julia> b = randbool(2000,2000,20);
julia> c = randbool(20,2000,2000);

julia> @time  boolReduceX(c);
elapsed time: 1.535334322 seconds (799092520 bytes allocated, 23.79% gc time)

julia> @time  boolReduce5(b);
elapsed time: 0.491410981 seconds (500520 bytes allocated)

      

+2


source


Try it any(x, 3)

. Just typing a little more so StackOverflow doesn't override this answer.

+2


source


Degenerate faster. It's just a matter of what kind of work you want to put in. The naive detected approach is slow because BitArray: extracting contiguous regions and bitwise OR can be done 64-bit at the same time, but the naive detected approach works an element at a time. In addition, BitArrays indexing is slow because there is a sequence of bit operations and because it cannot be embedded now due to bounds checking. Here's a strategy, deterministic, but using a BitArray structure. Most of the code is copied from copy_chunks! in bitarray.jl and I haven't tried to remove it (sorry!).

function devec(n::Int, x::BitArray)
    src = x.chunks
    out = falses(n, n)
    dest = out.chunks
    numbits = n*n

    kd0 = 1
    ld0 = 0
    for j = 1:n
        pos_s = (n*n)*(j-1)+1
        kd1, ld1 = Base.get_chunks_id(numbits - 1)
        ks0, ls0 = Base.get_chunks_id(pos_s)
        ks1, ls1 = Base.get_chunks_id(pos_s + numbits - 1)

        delta_kd = kd1 - kd0
        delta_ks = ks1 - ks0

        u = Base._msk64
        if delta_kd == 0
            msk_d0 = ~(u << ld0) | (u << (ld1+1))
        else
            msk_d0 = ~(u << ld0)
            msk_d1 = (u << (ld1+1))
        end
        if delta_ks == 0
            msk_s0 = (u << ls0) & ~(u << (ls1+1))
        else
            msk_s0 = (u << ls0)
        end

        chunk_s0 = Base.glue_src_bitchunks(src, ks0, ks1, msk_s0, ls0)

        dest[kd0] |= (dest[kd0] & msk_d0) | ((chunk_s0 << ld0) & ~msk_d0)

        delta_kd == 0 && continue

        for i = 1 : kd1 - kd0
            chunk_s1 = Base.glue_src_bitchunks(src, ks0 + i, ks1, msk_s0, ls0)

            chunk_s = (chunk_s0 >>> (64 - ld0)) | (chunk_s1 << ld0)

            dest[kd0 + i] |= chunk_s

            chunk_s0 = chunk_s1
        end
    end

    out
end

      

With Iain tests, this gives me:

simple: 0.051321131 seconds (46356000 bytes allocated, 30.03% gc time)
forloops: 6.226652258 seconds (92976 bytes allocated)
forloopscolfirst: 2.099381939 seconds (89472 bytes allocated)
shorty: 0.060194226 seconds (46387760 bytes allocated, 36.27% gc time)
timholy: 2.464298752 seconds (31784 bytes allocated)
devec: 0.008734413 seconds (31472 bytes allocated)

      

0


source







All Articles