Title: Stochastic Gradient Descent with Multiplicative Noise
Stochastic gradient descent (SGD) is the main optimization algorithm behind the success of deep learning. Recently, it is shown that the stochastic noise in SGD is multiplicative, i.e., the strength of the noise crucially depends on the model parameter. In this talk, we show that the dynamics of SGD can be very surprising and unintuitive when the noise is multiplicative. For example, we show that (1) SGD may converge to a local maximum; (2) SGD may escape a saddle point arbitrarily slowly; (3) SGD may prefer sharp minima over the flat ones; and (4) AMSGrad may converge to a local maximum. If time allows, we also present some recent results that shed light on how SGD works under the multiplicative noise. This presentation is mainly based on the following three works of the speaker.
Liu Ziyin. http://cat.phys.s.u-tokyo.ac.jp/~zliu/