Difference between tf.nn_conv2d and tf.nn.depthwise_conv2d
I'm not an expert on this, but as far as I understand, the difference is this:
Let's say you have a color image of an input with a length of 100, a width of 100. Thus, the dimensions are 100x100x3. For both examples, we use the same width and height filter of 5. Suppose we want the next layer to have a depth of 8.
In tf.nn.conv2d, you define the kernel shape as [width, height, in_channels, out_channels]. In our case, this means that the kernel has the form [5,5,3, out_channels]. The weight kernel that alternates above the image is 5x5x3, and it outputs 8 different function maps 8 times over the entire image.
In tf.nn.depthwise_conv2d, you define the core shape as [width, height, in_channels, channel_multiplier]. Now the output is done differently. Separate 5x5x1 filters are interspersed over each input image size, one filter per size, each creating one feature map for each dimension. So a kernel size [5,5,3,1] will generate an output with a depth of 3. The channel multiplier tells you how many different filters you want to apply for each dimension . So an initial desired output of depth 8 is not possible with 3 input dimensions. Only multiples of 3 are possible.
source to share
Check out the sample code in the TensorFlow API (r1.7)
For depthwise_conv2d
,
output[b, i, j, k * channel_multiplier + q] = sum_{di, dj} input[b, strides[1] * i + rate[0] * di, strides[2] * j + rate[1] * dj, k] * filter[di, dj, k, q]
filter [filter_height, filter_width, in_channels, channel_multiplier]
For conv2d
,
output[b, i, j, k] = sum_{di, dj, q} input[b, strides[1] * i + di, strides[2] * j + dj, q] * filter[di, dj, q, k]
filter [filter_height, filter_width, in_channels, out_channels]
By focusing on k
and q
, we can see the difference shown above.
The default format NHWC
, where b
is the batch size, (i, j)
is the coordinate in the function map.
(Note that k
both are q
referring to different things in the two functions.)
- For
depthwise_conv2d
,k
refers to the input channel, andq
,0 <= q < channel_multiplier
refers to the output channel. Each input channelk
expands tok*channel_multiplier
with different filters[filter_height, filter_width, channel_multiplier]
. He does not do cross-cutting work, in some literature he is calledchannel-wise spatial convolution
. The above process can be summed up as applying the kernels of each filter separately to each channel and concatenating the outputs. - For
conv2d
,k
refers to the output channel, andq
refers to the input channel. It is summed among all input channels, which means that each output channelk
is connected to all input channelsq
using a filter[filter_height, filter_width, in_channels]
.
For example,
input_size: (_, 14, 14, 32)
filter of conv2d: (3, 3, 32, 64)
params of conv2d filter: 3x3x32x64
filter of depthwise_conv2d: (3, 3, 32, 64)
params of depthwise_conv2d filter: 3x3x32x64
suppose stride = 1 padded then
output of conv2d: (_, 14, 14, 64)
output of depthwise_conv2d: (_, 14, 14, 32*64)
Additional Information:
- The standard convolution operation can be divided into 2 steps: depth convolution and reduction (sum).
- Convolution absorption is equivalent to setting the number of channels in the group in the Convolution group.
- Usually
depthwise_conv2d
followedpointwise_conv2d
(1x1 convolution for pruning purpose) by creatingseparable_conv2d
. For details, Xception , MobileNet .
source to share
tf.nn.depthwise_conv2d means using N different filters for each input channel. The output will have N * channel_multiplier different output channels. Run the code and you will find out.
import tensorflow as tf import numpy as np # input image with 10x10 shape for 3 channels # filter with 10x10 shape for each input channel N_in_channel = 3 N_out_channel_mul = 8 x = tf.random_normal([1, 10, 10, N_in_channel]) f = tf.random_normal([10, 10, N_in_channel, N_out_channel_mul]) y = tf.nn.depthwise_conv2d(x, f, strides=[1, 1, 1, 1], padding="VALID", data_format="NHWC") sess = tf.Session() sess.run(tf.global_variables_initializer()) x_data, f_data, y_conv = sess.run([x, f, y]) y_s = np.squeeze(y_conv) for i in range(N_in_channel): for j in range(N_out_channel_mul): print("np: %f, tf:%f" % (np.sum(x_data[0, :, :, i] * f_data[:, :, i, j]), y_s[i * N_out_channel_mul + j]))
source to share