`uniq` for 2D Anano tensor
I have this code:
def uniq(seq):
"""
Like Unix tool uniq. Removes repeated entries.
:param seq: numpy.array. (time,) -> label
:return: seq
"""
diffs = np.ones_like(seq)
diffs[1:] = seq[1:] - seq[:-1]
idx = diffs.nonzero()
return seq[idx]
Now I want to expand this to support 2D arrays and use Theano. It should be fast on the GPU.
I will get an array with multiple sequences as multiple batches in the format (time, batch) and time_mask
that indirectly indicates the length of each sequence.
My current attempt:
def uniq_with_lengths(seq, time_mask):
# seq is (time,batch) -> label
# time_mask is (time,batch) -> 0 or 1
num_batches = seq.shape[1]
diffs = T.ones_like(seq)
diffs = T.set_subtensor(diffs[1:], seq[1:] - seq[:-1])
time_range = T.arange(seq.shape[0]).dimshuffle([0] + ['x'] * (seq.ndim - 1))
idx = T.switch(T.neq(diffs, 0) * time_mask, time_range, -1)
seq_lens = T.sum(T.ge(idx, 0), axis=0) # (batch,) -> len
max_seq_len = T.max(seq_lens)
# I don't know any better way without scan.
def step(batch_idx, out_seq_b1):
out_seq = seq[T.ge(idx[:, batch_idx], 0).nonzero(), batch_idx][0]
return T.concatenate((out_seq, T.zeros((max_seq_len - out_seq.shape[0],), dtype=seq.dtype)))
out_seqs, _ = theano.scan(
step,
sequences=[T.arange(num_batches)],
outputs_info=[T.zeros((max_seq_len,), dtype=seq.dtype)]
)
# out_seqs is (batch,max_seq_len)
return out_seqs.T, seq_lens
How to build out_seqs
directly?
I would do something like out_seqs = seq[idx]
, but I'm not really sure how to express it.
source to share
Here's a quick answer that only addresses part of your problem:
def compile_theano_uniq(x):
diffs = x[1:] - x[:-1]
diffs = tt.concatenate([tt.ones_like([x[0]], dtype=diffs.dtype), diffs])
y = diffs.nonzero_values()
return theano.function(inputs=[x], outputs=y)
theano_uniq = compile_theano_uniq(tt.vector(dtype='int32'))
The key nonzero_values()
.
Update: I can't imagine how to do this without using theano.scan
. To be clear and using 0 as a complement, I am assuming that given the input
1 1 2 3 3 4 0
1 2 2 2 3 3 4
1 2 3 4 5 0 0
you want the output to be
1 2 3 4 0 0 0
1 2 3 4 0 0 0
1 2 3 4 5 0 0
or even
1 2 3 4 0
1 2 3 4 0
1 2 3 4 5
You can identify the indices of the items you want to keep without using a scan. Then either the new tensor has to be built from scratch, or the values you want to keep as moved to make the sequences continuous. Neither approach is possible without theano.scan
.
source to share