`uniq` for 2D Anano tensor

Question

`uniq` for 2D Anano tensor

I have this code:

def uniq(seq):
  """
  Like Unix tool uniq. Removes repeated entries.
  :param seq: numpy.array. (time,) -> label
  :return: seq
  """
  diffs = np.ones_like(seq)
  diffs[1:] = seq[1:] - seq[:-1]
  idx = diffs.nonzero()
  return seq[idx]

Now I want to expand this to support 2D arrays and use Theano. It should be fast on the GPU.

I will get an array with multiple sequences as multiple batches in the format (time, batch) and time_mask

that indirectly indicates the length of each sequence.

My current attempt:

def uniq_with_lengths(seq, time_mask):
  # seq is (time,batch) -> label
  # time_mask is (time,batch) -> 0 or 1
  num_batches = seq.shape[1]
  diffs = T.ones_like(seq)
  diffs = T.set_subtensor(diffs[1:], seq[1:] - seq[:-1])
  time_range = T.arange(seq.shape[0]).dimshuffle([0] + ['x'] * (seq.ndim - 1))
  idx = T.switch(T.neq(diffs, 0) * time_mask, time_range, -1)
  seq_lens = T.sum(T.ge(idx, 0), axis=0)  # (batch,) -> len
  max_seq_len = T.max(seq_lens)

  # I don't know any better way without scan.
  def step(batch_idx, out_seq_b1):
    out_seq = seq[T.ge(idx[:, batch_idx], 0).nonzero(), batch_idx][0]
    return T.concatenate((out_seq, T.zeros((max_seq_len - out_seq.shape[0],), dtype=seq.dtype)))

 out_seqs, _ = theano.scan(
    step,
    sequences=[T.arange(num_batches)],
    outputs_info=[T.zeros((max_seq_len,), dtype=seq.dtype)]
  )
  # out_seqs is (batch,max_seq_len)
  return out_seqs.T, seq_lens

How to build out_seqs

directly?

I would do something like out_seqs = seq[idx]

, but I'm not really sure how to express it.

+3

python numpy theano

Albert 13 jul. '15 at 9:39

source to share

1 answer

Daniel Renshaw · Accepted Answer · 2015-07-13T13:01:13+0000

Here's a quick answer that only addresses part of your problem:

def compile_theano_uniq(x):
    diffs = x[1:] - x[:-1]
    diffs = tt.concatenate([tt.ones_like([x[0]], dtype=diffs.dtype), diffs])
    y = diffs.nonzero_values()
    return theano.function(inputs=[x], outputs=y)

theano_uniq = compile_theano_uniq(tt.vector(dtype='int32'))

The key nonzero_values()

.

Update: I can't imagine how to do this without using theano.scan

. To be clear and using 0 as a complement, I am assuming that given the input

1 1 2 3 3 4 0
1 2 2 2 3 3 4
1 2 3 4 5 0 0

you want the output to be

1 2 3 4 0 0 0
1 2 3 4 0 0 0
1 2 3 4 5 0 0

or even

1 2 3 4 0
1 2 3 4 0
1 2 3 4 5

You can identify the indices of the items you want to keep without using a scan. Then either the new tensor has to be built from scratch, or the values you want to keep as moved to make the sequences continuous. Neither approach is possible without theano.scan

.

`uniq` for 2D Anano tensor

More articles: