In-place computation with dask

Question

In-place computation with dask

Short version

I have a dask array whose graph is ultimately based on a bunch of numpy arrays at the bottom, and which applies element operations to them. Is it safe to use da.store

numpy to compute the array and store the results back into the original fallback arrays, which does it all in place?

If you think "you are using dask incorrectly" then take a look at the long version below why I feel the need to do this.

Long version

I am using dask for an application where raw data is sourced from in-memory numpy arrays containing data gathered from a scientific tool. The goal is to fill most of the RAM (e.g. 75% +) with raw data, which means that this is not enough to make a copy in memory. This makes it semantically a bit like a kernel problem, since any derived value can only be implemented in memory in chunks, not all at once.

Dask works well for this, except for one wrinkle. I'm oversimplifying a lot, but on most data (call it X) we need to apply an atomic operation f

, compute some summary statistic s(f(X))

and use that to compute another result from the data, say t(s(f(X)), f(X))

. While all functions are dask-friendly (this can be done on a per-part basis), trying to just run this dask graph will result in everything f(X)

being stored in memory at once, because the chunks are needed for the second pass.An alternative is to do a straight computation s

before the request t

(like suggested https://github.com/dask/dask/issues/874 ) and thus pay the computation f(X)

twice, but this is a somewhat expensive operation, so I would like to avoid that.

However, once f

applied, the raw data is no longer needed. So I would like to run da.store(f(X))

and store it in numpy source array arrays. Technically I think I know how to set this up, and as long as I'm sure every piece of data is completely consumed before it gets overwritten, then there is no race condition, but I am worried that I might violate the API contract by changing the data under dask and that it might go wrong. Is there a way to ensure this is safe?

At one point, I can see right away that this is going wrong if multiple input arrays have the same content and therefore get the same name in the dask, which results in them being concatenated in the graph. I use name=False

in da.from_array

, although that shouldn't be a problem.