Abstract
Date and Time: November 22nd, 17:00 – 18:00 (JST)
Venue: Hybrid at
Open Space, Nihonbashi office, RIKEN-AIP
Nihonbashi 1-chome Mitsui Building, 15th floor, 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan
OR
Zoom After this registration, you will receive a Zoom link.
Language: English
Speaker: Prof. Jason Lee
(Associate professor, Electrical Engineering and Computer Science (secondary), Princeton University)
Title: Feature Learning in Deep Learning
Abstract
We focus on the task of learning a single index model sigma(w* x)$ with respect to the isotropic Gaussian distribution in d dimensions, including the special case when sigma is a kth order hermite which corresponds to the Gaussian analog of parity learning. Prior work has shown that the sample complexity of learning w* is governed by the emph{information exponent} k* of the link function sigma, which is defined as the index of the first nonzero Hermite coefficient of sigma. Prior upper bounds have shown that n > d^{k*-1} samples suffice for learning w* and that this is tight for online SGD (Ben Arous et al., 2020). However, the CSQ lower bound for gradient based methods only shows that n > d^{k*/2}$ samples are necessary. In this work, we close the gap between the upper and lower bounds by showing that online SGD on a smoothed loss learns w* with n > d^{k*/2}$ samples.
Next, we turn to the problem of learning multi index models f(x) = g(Ux), where U encodes a latent representation of low dimension. Significant prior work has established that neural networks trained by gradient descent behave like kernel methods, despite significantly worse empirical performance of kernel methods. However, in this work we demonstrate that for this large class of functions that there is a large gap between kernel methods and gradient descent on a two-layer neural network, by showing that gradient descent learns representations relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime. Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form f*(x)=g(Ux) where U is d by r. When the degree of f* is p, it is known that n≍dp samples are necessary to learn f* in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to f*. This results in an improved sample complexity of n≍d^2r+drp. Furthermore, in a transfer learning setup where the data distributions in the source and target domain share the same representation U but have different polynomial heads we show that a popular heuristic for transfer learning has a target sample complexity independent of d.
Bio:
Jason Lee is an associate professor in Electrical Engineering and Computer Science (secondary) at Princeton University. Prior to that, he was in the Data Science and Operations department at the University of Southern California and a postdoctoral researcher at UC Berkeley working with Michael I. Jordan. Jason received his PhD at Stanford University advised by Trevor Hastie and Jonathan Taylor. His research interests are in the theory of machine learning, optimization, and statistics. Lately, he has worked on the foundations of deep learning, representation learning, and reinforcement learning. He has received an NSF Career Award, ONR Young Investigator Award in Mathematical Data Science, Sloan Research Fellowship, NeurIPS Best Student Paper Award and Finalist for the Best Paper Prize for Young Researchers in Continuous Optimization.
Hosted by Deep Learning Theory Team
More Information
Date | November 22, 2023 (Wed) 17:00 - 18:00 |
URL | https://c5dc59ed978213830355fc8978.doorkeeper.jp/events/165846 |