Activation, Optimizer and loss function

  1. What is the problem with ReLu?

  • Exploding gradient(Solved by gradient clipping)

  • Dying ReLu — No learning if the activation is 0 (Solved by parametric relu)

  • Mean and variance of activations is not 0 and 1.(Partially solved by subtracting around 0.5 from activation)

2. What are the limitation of Adam optimiser?

“While training with Adam helps in getting fast convergence, the resulting model will often have worse generalization performance than when training with SGD with momentum. Another issue is that even though Adam has adaptive learning rates its performance improves when using a good learning rate schedule. Especially early in the training, it is beneficial to use a lower learning rate to avoid divergence. This is because in the beginning, the model weights are random, and thus the resulting gradients are not very reliable. A learning rate that is too large might result in the model taking too large steps and not settling in on any decent weights. When the model overcomes these initial stability issues the learning rate can be increased to speed up convergence. This process is called learning rate warm-up, and one version of it is described in the paper Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.” — from iprally

3. How is AdamW different from Adam?

AdamW is Adam with L2 regularisation on weight as models with smaller weights generalise better.

4.

Last updated