Significance of Auxiliary Output in a Multi-Input / Multi-Output Model Using a Deep Network

I am referring to keras documentation for creating a web that accepts multiple inputs as attachments and some other important functionality. But I didn’t understand the exact effect of the collateral loss if we have already determined the base loss.

Here we add auxiliary loss, allowing the LSTM and Embedding layer to be processed smoothly, even if the underlying loss is much higher in the model.

As mentioned in the doc, I am assuming it helps to smoothly train on the Embedding / any other layer defined earlier. My question is how to determine the scales for additional loss.

We assemble the model and assign a weight of 0.2 to the auxiliary losses. To specify a different loss_weights or loss for each other output, you can use a list or dictionary.

I would be very grateful if someone could explain how to define weight loss and how a higher / lower value of auxiliary weight loss affects model training and prediction.

+3


source to share


1 answer


This is a really interesting problem. The idea of ​​auxiliary classifiers is not as unusual as one might think. It is used, for example, in Initial Architecture . In this answer, I'll try to provide you with some intuitions as to why this setting might help your learning curve:

  • This helps the gradient move down to the lower levels: you might think that the loss defined for the auxiliary classifier is conceptually similar to the main loss - because they both measure how good our model is. Therefore, we can assume that the gradient wrt for the lower layers should be the same for both of these losses. The vanishing gradient phenomenon still occurs even though we have methods like Batch Normalization - so every additional help can improve your training performance.

  • This makes the lower-level functions more precise:while we train our network - information about how good the low-level functions of the model are and how to change them, go through all the other levels of your network. This can lead not only to the disappearance of the gradient, but also to the fact that the operations performed during the computation of the neural network can be really complex - it can also make information about your lower-level functions irrelevant. This is really important, especially in the early stages of learning - when most of your functions are pretty random (due to accidental launch) and the direction your weights are being pushed into can be semantically bizarre. This problem can be overcome with auxiliary outputs because in this setup your lower-level functions are made meaningful from a very early stage in the learning curve.

  • This can be seen as a sensible regularization: you put significant constraint in your model that can prevent overriding, especially on small datasets.



From what I wrote above, some hints can be made on how to set the auxiliary loss weight:

  • It is good if it is larger at the beginning of training .
  • This should help transfer information across your network, but it shouldn't interfere with the learning process either. Thus, the rule of thumb that the deeper additional output, the greater the weight of the loss , is reasonably reasonable.
  • If your dataset is not large or the training time is not that long, you can try to actually tune it with some sort of hyperparameter optimization.
  • You must remember that your main weight loss is the most important - and even if ancillary products can help - their weight loss should be relatively less than the main weight loss.
+1


source







All Articles