Reducing filter size in convolutional neural network

I'm reading the introductory article by Szegedy et al: https://arxiv.org/abs/1512.00567 and I'm having trouble understanding how they reduce computation by replacing one 5x5 filter with 2 layers of 3x3 filters (section 3.1).

enter image description here

Specifically, this passage:

If we naively crawled the network without reusing computation between adjacent grid slabs, we would increase the computational cost. the sliding of this network can be represented by two convolutions of 3x3 layers that reuse activation between adjacent slabs.

I don't understand how we can reuse these activations.

+3


source to share


1 answer


So, first of all, the author states the following:

So we end up with a network of (9 + 9) / 25 ร— reduction in computation, resulting in a relative gain of 28% thanks to this factorization.

And he's right: for a 5x5 filter, you need to use 25 ( 5*5

) individual weights. For two 3x3 filters, you must use the individual 9 + 9 scales ( 3*3 + 3*3

). Therefore, using two 3x3 filters requires less parameters. However, you are correct, this does not mean that it requires less computation: at first glance, using two 3x3 filters requires much more operations.

Let's compare the number of operations for two parameters for a given input n*n

. Step by step guide:

  • Calculate the dimension of the output filter 5x5 at the given input ( (n - filtersize + 1)^2

    ) and the corresponding operations
  • Calculate the output size of the first filter 3x3 (same formula as above) and its corresponding operations
  • Calculate the output size of the second 3x3 filter as well as its corresponding operations

Let's start by typing 5x5

:

1. (5 - 5 + 1)^2 = 1x1. So 1*1*25 operations = 25 operations
2. (5 - 3 + 1)^2 = 3x3. So 3*3*9  operations = 81 operations
3. (3 - 3 + 1)^2 = 1x1. So 1*1*9  operations = 9  operations
So 25 vs 90 operations. Using a single 5x5 filter is best for a 5x5 input.

      

Then input 6x6

:

1. (6 - 5 + 1)^2 = 2x2. So 2*2*25 operations = 100 operations
2. (6 - 3 + 1)^2 = 4x4. So 4*4*9  operations = 144 operations
3. (4 - 3 + 1)^2 = 2x2. So 2*2*9  operations = 36  operations
So 100 vs 180 operations. Using a single 5x5 filter is best for a 6x6 input.

      

Jump forward, 8x8

input:

1. (8 - 5 + 1)^2 = 4x4. So 4*4*25 operations = 400 operations
2. (8 - 3 + 1)^2 = 6x6. So 6*6*9  operations = 324 operations
3. (4 - 3 + 1)^2 = 4x4. So 4*4*9  operations = 144 operations
So 400 vs 468 operations. Using a single 5x5 filter is best for a 8x8 input.

      



Pay attention to the pattern? Given the size of the input n*n

, the operations for the filter 5x5

have the following formula:

(n - 4)*(n - 4) * 25

      

And for a 3x3 filter:

(n - 2)*(n - 2) * 9 + (n - 4) * (n - 4) * 9

      

So, do this:

enter image description here

They seem to overlap! As you can read in the graph above, the number of operations appears to be less for the two 3x3 filters from n=10

and on!

Conclusion: It seems to be effective to use two 3x3 filters after n=10

. Also, regardless of n

, fewer parameters need to be adjusted for two 3x3 filters compared to one 5x5 filter.


This article looks strange, but for some reason it seems to her that using two 3x3 filters compared to a 5x5 filter is "obvious":

This setting clearly reduces the number of parameters by sharing the weight between adjacent slabs.

it seems natural to reuse translational invariance and replace the bound component entirely with a two-layer convolutional architecture

If a naive slide

+1


source







All Articles