Difference between tf.nn_conv2d and tf.nn.depthwise_conv2d

What is the difference between tf.nn_conv2d

and tf.nn.depthwise_conv2d

in Tensorflow?

+3


source to share


3 answers


I'm not an expert on this, but as far as I understand, the difference is this:

Let's say you have a color image of an input with a length of 100, a width of 100. Thus, the dimensions are 100x100x3. For both examples, we use the same width and height filter of 5. Suppose we want the next layer to have a depth of 8.



In tf.nn.conv2d, you define the kernel shape as [width, height, in_channels, out_channels]. In our case, this means that the kernel has the form [5,5,3, out_channels]. The weight kernel that alternates above the image is 5x5x3, and it outputs 8 different function maps 8 times over the entire image.

In tf.nn.depthwise_conv2d, you define the core shape as [width, height, in_channels, channel_multiplier]. Now the output is done differently. Separate 5x5x1 filters are interspersed over each input image size, one filter per size, each creating one feature map for each dimension. So a kernel size [5,5,3,1] will generate an output with a depth of 3. The channel multiplier tells you how many different filters you want to apply for each dimension . So an initial desired output of depth 8 is not possible with 3 input dimensions. Only multiples of 3 are possible.

+4


source


Check out the sample code in the TensorFlow API (r1.7)

For depthwise_conv2d

,

output[b, i, j, k * channel_multiplier + q] =
    sum_{di, dj} input[b, strides[1] * i + rate[0] * di,
                          strides[2] * j + rate[1] * dj, k] *
                 filter[di, dj, k, q]

      

filter [filter_height, filter_width, in_channels, channel_multiplier]

For conv2d

,

output[b, i, j, k] =
    sum_{di, dj, q} input[b, strides[1] * i + di,
                             strides[2] * j + dj, q] *
                    filter[di, dj, q, k]

      

filter [filter_height, filter_width, in_channels, out_channels]

By focusing on k

and q

, we can see the difference shown above.

The default format NHWC

, where b

is the batch size, (i, j)

is the coordinate in the function map.



(Note that k

both are q

referring to different things in the two functions.)

  • For depthwise_conv2d

    , k

    refers to the input channel, and q

    , 0 <= q < channel_multiplier

    refers to the output channel. Each input channel k

    expands to k*channel_multiplier

    with different filters [filter_height, filter_width, channel_multiplier]

    . He does not do cross-cutting work, in some literature he is called channel-wise spatial convolution

    . The above process can be summed up as applying the kernels of each filter separately to each channel and concatenating the outputs.
  • For conv2d

    , k

    refers to the output channel, and q

    refers to the input channel. It is summed among all input channels, which means that each output channel k

    is connected to all input channels q

    using a filter [filter_height, filter_width, in_channels]

    .

For example,

input_size: (_, 14, 14, 32)
filter of conv2d: (3, 3, 32, 64)
params of conv2d filter: 3x3x32x64
filter of depthwise_conv2d: (3, 3, 32, 64)
params of depthwise_conv2d filter: 3x3x32x64

      

suppose stride = 1 padded then

output of conv2d: (_, 14, 14, 64)
output of depthwise_conv2d: (_, 14, 14, 32*64)

      

Additional Information:

  • The standard convolution operation can be divided into 2 steps: depth convolution and reduction (sum).
  • Convolution absorption is equivalent to setting the number of channels in the group in the Convolution group.
  • Usually depthwise_conv2d

    followed pointwise_conv2d

    (1x1 convolution for pruning purpose) by creating separable_conv2d

    . For details, Xception , MobileNet .
+3


source


tf.nn.depthwise_conv2d means using N different filters for each input channel. The output will have N * channel_multiplier different output channels. Run the code and you will find out.

import tensorflow as tf
import numpy as np
# input image with 10x10 shape for 3 channels
# filter with 10x10 shape for each input channel

N_in_channel = 3
N_out_channel_mul = 8
x = tf.random_normal([1, 10, 10, N_in_channel])
f = tf.random_normal([10, 10, N_in_channel, N_out_channel_mul])
y = tf.nn.depthwise_conv2d(x, f, strides=[1, 1, 1, 1], padding="VALID", data_format="NHWC")

sess = tf.Session()
sess.run(tf.global_variables_initializer())

x_data, f_data, y_conv = sess.run([x, f, y])

y_s = np.squeeze(y_conv)
for i in range(N_in_channel):
    for j in range(N_out_channel_mul):
        print("np: %f, tf:%f" % (np.sum(x_data[0, :, :, i] * f_data[:, :, i, j]), y_s[i * N_out_channel_mul + j]))
      

Run codeHide result


+1


source







All Articles