Difference between tf.nn_conv2d and tf.nn.depthwise_conv2d

Question

Difference between tf.nn_conv2d and tf.nn.depthwise_conv2d

What is the difference between tf.nn_conv2d

and tf.nn.depthwise_conv2d

in Tensorflow?

+3

python deep-learning tensorflow conv-neural-network

Chaine May 28 '17 at 11:43

source to share

3 answers

dutchJSCOOP · Answer 1 · 2017-07-14T17:16:07+0000

I'm not an expert on this, but as far as I understand, the difference is this:

Let's say you have a color image of an input with a length of 100, a width of 100. Thus, the dimensions are 100x100x3. For both examples, we use the same width and height filter of 5. Suppose we want the next layer to have a depth of 8.

In tf.nn.conv2d, you define the kernel shape as [width, height, in_channels, out_channels]. In our case, this means that the kernel has the form [5,5,3, out_channels]. The weight kernel that alternates above the image is 5x5x3, and it outputs 8 different function maps 8 times over the entire image.

In tf.nn.depthwise_conv2d, you define the core shape as [width, height, in_channels, channel_multiplier]. Now the output is done differently. Separate 5x5x1 filters are interspersed over each input image size, one filter per size, each creating one feature map for each dimension. So a kernel size [5,5,3,1] will generate an output with a depth of 3. The channel multiplier tells you how many different filters you want to apply for each dimension . So an initial desired output of depth 8 is not possible with 3 input dimensions. Only multiples of 3 are possible.

drowsyleilei · Answer 2 · 2018-04-11T05:40:39+0000

Check out the sample code in the TensorFlow API (r1.7)

For depthwise_conv2d

,

output[b, i, j, k * channel_multiplier + q] =
    sum_{di, dj} input[b, strides[1] * i + rate[0] * di,
                          strides[2] * j + rate[1] * dj, k] *
                 filter[di, dj, k, q]

filter [filter_height, filter_width, in_channels, channel_multiplier]

For conv2d

,

output[b, i, j, k] =
    sum_{di, dj, q} input[b, strides[1] * i + di,
                             strides[2] * j + dj, q] *
                    filter[di, dj, q, k]

filter [filter_height, filter_width, in_channels, out_channels]

By focusing on k

and q

, we can see the difference shown above.

The default format NHWC

, where b

is the batch size, (i, j)

is the coordinate in the function map.

(Note that k

both are q

referring to different things in the two functions.)

For depthwise_conv2d

, k

refers to the input channel, and q

, 0 <= q < channel_multiplier

refers to the output channel. Each input channel k

expands to k*channel_multiplier

with different filters [filter_height, filter_width, channel_multiplier]

. He does not do cross-cutting work, in some literature he is called channel-wise spatial convolution

. The above process can be summed up as applying the kernels of each filter separately to each channel and concatenating the outputs.
For conv2d

, k

refers to the output channel, and q

refers to the input channel. It is summed among all input channels, which means that each output channel k

is connected to all input channels q

using a filter [filter_height, filter_width, in_channels]

.

For example,

input_size: (_, 14, 14, 32)
filter of conv2d: (3, 3, 32, 64)
params of conv2d filter: 3x3x32x64
filter of depthwise_conv2d: (3, 3, 32, 64)
params of depthwise_conv2d filter: 3x3x32x64

suppose stride = 1 padded then

output of conv2d: (_, 14, 14, 64)
output of depthwise_conv2d: (_, 14, 14, 32*64)

Additional Information:

The standard convolution operation can be divided into 2 steps: depth convolution and reduction (sum).
Convolution absorption is equivalent to setting the number of channels in the group in the Convolution group.
Usually depthwise_conv2d

followed pointwise_conv2d

(1x1 convolution for pruning purpose) by creating separable_conv2d

. For details, Xception , MobileNet .

kbxu · Answer 3 · 2018-01-30T06:38:07+0000

tf.nn.depthwise_conv2d means using N different filters for each input channel. The output will have N * channel_multiplier different output channels. Run the code and you will find out.

import tensorflow as tf
import numpy as np
# input image with 10x10 shape for 3 channels
# filter with 10x10 shape for each input channel

N_in_channel = 3
N_out_channel_mul = 8
x = tf.random_normal([1, 10, 10, N_in_channel])
f = tf.random_normal([10, 10, N_in_channel, N_out_channel_mul])
y = tf.nn.depthwise_conv2d(x, f, strides=[1, 1, 1, 1], padding="VALID", data_format="NHWC")

sess = tf.Session()
sess.run(tf.global_variables_initializer())

x_data, f_data, y_conv = sess.run([x, f, y])

y_s = np.squeeze(y_conv)
for i in range(N_in_channel):
    for j in range(N_out_channel_mul):
        print("np: %f, tf:%f" % (np.sum(x_data[0, :, :, i] * f_data[:, :, i, j]), y_s[i * N_out_channel_mul + j]))

Run code Hide result

Difference between tf.nn_conv2d and tf.nn.depthwise_conv2d

More articles: