Is it ok to use batch normalization in RNN / lstm RNN?

I start in deep learning. I know that normal neural networks use a standard batch before activation and this will reduce the reliance on good weight initialization. I wonder if it will do the same with RNN / lstm RNN when I use it. Anyone have any experience? Thank.

+8


source to share


5 answers


No, you cannot use batch normalization on a periodic neural network, since statistics are calculated for each packet, this does not take into account the recurrent part of the network. The weights are split into RNNs, and the activation response for each "recurrent cycle" can have completely different statistical properties.



Other methods have been developed, like batch normalization, that take these constraints into account, such as level normalization . There are also LSTM layer reparameterizations that allow batch normalization, for example, as described in Recurrent Batch Normalization by Coijmaans et al. 2016.

+5


source


Batch normalization applied to RNN is similar to batch normalization applied to CNN: you compute statistics in such a way that the recurrent / convolutional properties of the layer are still preserved after BN is applied.

For CNN, this means calculating relevant statistics not only for the mini-batch, but also for two spatial dimensions; in other words, normalization is applied to the channel measurement.



For RNNs, this means calculating the appropriate statistics on the mini-burst and the time / step measurement, so normalization is only applied to the vector depth. This also means that you are only normalizing the transformed input (e.g. in vertical directions, for example BN(W_x * x)

), since horizontal (in time) joins are time-dependent and shouldn't just be averaged.

+5


source


It is not commonly used, although I found this document from 2017 shows a way to use batch standardization in input hide and hidden transforms hide faster and improve some issues.

Also, check out Cross Validated for a more machine learning oriented Q&A.

+2


source


In any non-repeating network (connect or not), when you do BN, each level adjusts the input scale and means that the input distribution for each level does not change (which is what the authors of the BN article claim). advantage of BN).

The problem with this for repeating RNN outputs is that the parameters for the inbound distribution are now shared across all time steps (which are actually levels in back propagation in time or BPTT). Thus, the distribution turns out to be fixed over time layers of BPTT. This may not make sense, as there may be a structure in the data that changes (no coincidence) in the time series. For example, if the time series is a sentence, certain words are more likely to appear at the beginning or end. Thus, a fixed allocation can reduce the efficiency of BN.

+2


source


The answer is yes and no.

Why Yes, according to the paper layer rationing , the section clearly states the use of BN in RNN.

Why not? The distribution of the output at each time step must be stored and calculated for the BN. Imagine completing a sequence input so that all examples are of the same length, so if the predicted case is longer than all the training scenarios, at some time step you will not have the mean / standard deviation of the output distribution summed up from the SGD training procedure ...

Meanwhile, at least in Keras, I believe the BN layer only considers normalization in the vertical direction, i.e. sequence output. The horizontal direction, i.e. hidden_status, cell_status, is not normalized. Correct me if I'm wrong here.

In multilayer RNNs, you can use level normalization techniques.

0


source







All Articles