## SignSGD: A Success Story in Bridging the Gap between Theory and Practice of Deep Learning

**Speaker: **Animashree Anandkumar (CalTech)

**Abstract: ** These are exciting times for deep learning where it continues to be deployed in the real world. However, theoretical understanding of deep learning is lagging behind. One way around this is to establish theoretical guarantees under certain assumptions and then to rigorously verify these assumptions in a number of empirical settings. We do this for signSGD: a distributed training algorithm, where workers only communicates the sign of the stochastic gradients to a parameter server and the server transmits back the majority vote. Thus, this algorithm uses 32 times less communication per iteration than distributed SGD. We show that signSGD is theoretically well-suited to the geometry of neural network error landscapes in the large and mini batch settings. We also prove that convergence for signSGD is robust when up to 50% of workers behave adversarially. On the practical side, we built our distributed training system in Pytorch. Comparing to the state-of-art collective communications library (NCCL), our framework with only a single parameter server reduced the time for training Resnet on Imagenet over 7 AWS P3.2x machines by 30% with almost no accuracy loss. In addition, we find that the assumptions made for our theoretical guarantees to hold are true in a wide range of empirical settings.

## Bounds on expressibility and training of neural networks

**Speaker: **Amitabh Basu (John Hopkins)

**Abstract: ** We will discuss depth v/s size trade-offs in the expressibility of neural networks with the ReLU activation gate. We show that depth can be enormously beneficial in the efficiency of expressing certain function using neural networks, often leading exponential (even superexponential) blow up in size. We will do this in the context of both real inputs and Boolean inputs. The latter has implications for the Boolean circuit complexity of linear threshold circuits, a challenging topic in the circuit complexity literature. We then discuss the computational complexity of training neural networks, with careful attention to the dependence on the size of the network and the dimension of the input. We end with an application of neural networks to the sparse coding problem. In particular, we prove that the loss function of an appropriate autoencoder has a critical point around the solution of the sparse coding problem. This provides some mathematical foundation for the effectiveness of using autoencoder training to solve sparse coding.

## Adversarial examples from computational constraints

**Speaker: **Sebastien Bubeck (MicroSoft Research)

**Abstract: **Why are classifiers in high dimension vulnerable to “adversarial” perturbations? We show that it is likely not due to information theoretic limitations, but rather it could be due to computational constraints. We construct a binary classification task in high dimensional space which is (i) information theoretically easy to learn robustly for large perturbations, (ii) efficiently learnable (non-robustly) by a simple linear separator, (iii) yet is not efficiently robustly learnable, even for small perturbations, by any algorithm in the statistical query (SQ) model. This example gives an exponential separation between classical learning and robust learning in the statistical query model. It suggests that adversarial examples may be an unavoidable byproduct of computational limitations of learning algorithms. I will also mention some further developments to go beyond the SQ model.

Joint work with Eric Price and Ilya Razenshteyn.

## On Adversarial Learning

**Speaker: **Lawrence Carin (Duke University)

**Abstract: **Generative adversarial networks (GANs) are considered from a fundamental statistical perspective, from which their utility and relationship to prior work is revealed. In addition to providing theoretical foundations, applications are considered in multiple settings.

## Gradient Descent with Random Initialization

**Speaker: **Jianqing Fan (Princeton)

**Abstract: **This talk contributes to the theoretical understanding of deep learning by considering global convergence for nonconvex phase retrieval. Specifically, we consider the problem of solving systems of quadratic equations, namely, recovering an object of interest x from m quadratic equations or samples y_i = (a^T_i x)^2, 1 ≤ i ≤ m. This problem, also dubbed as phase retrieval, spans multiple domains including physical sciences and machine learning. We investigate the efficiency of gradient descent (or Wirtinger flow) designed for the nonconvex least squares problem. We prove that under Gaussian designs, gradient descent — when randomly initialized — yields an ε-accurate solution in O (log n + log(1/ε) )iterations given nearly minimal samples, thus achieving near-optimal computational and sample complexities at once. This provides the first global convergence guarantee concerning vanilla gradient descent for phase retrieval, without the need of (i) carefully-designed initialization, (ii) sample splitting, or (iii) sophisticated saddle-point escaping schemes. All of these are achieved by exploiting the statistical models in analyzing optimization algorithms, via a leave-one-out approach that enables the decoupling of certain statistical dependency between the gradient descent iterates and the data.

(Joint work with Yuxin Chen, Yuejie Chi, and Cong Ma)

## Efficient Algorithms for Learning One Convolutional Layer

**Speaker: **Adam Klivans (UT Austin)

**Abstract: **We give the first provably efficient algorithm for learning a general class of one-layer convolutional networks. We prove that our framework captures commonly used schemes from computer vision, including one-dimensional and two-dimensional “patch and stride” convolutions. Prior work required strong assumptions on the patch structure of the network (typically non-overlapping patches).

Our algorithm– Convotron — is inspired by recent work applying isotonic regression to learning neural networks. Convotron uses a simple, iterative update rule that is stochastic in nature and tolerant to noise. In contrast to gradient descent, Convotron requires no special initialization or learning-rate tuning to converge to the global optimum.

## Accuracy of High-Dimensional Deep Learning Networks

**Speaker: **Jason Klusowski (Rutgers)

**Abstract: ** It has been experimentally observed in recent years that multi-layer artificial neural networks have a surprising ability to generalize, even when trained with far more parameters than observations. Is there a theoretical basis for this? The best available bounds on their metric entropy and associated complexity measures are essentially linear in the number of parameters, which is inadequate to explain this phenomenon. Here we examine the statistical risk (mean squared predictive error) of multi-layer networks with $\ell^1$-type controls on their parameters and with ramp activation functions (also called lower-rectified linear units). In this setting, the risk is shown to be upper bounded by $[(L^3 \log d)/n]^{1/2}$, where $d$ is the input dimension to each layer, $L$ is the number of layers, and $n$ is the sample size. In this way, the input dimension can be much larger than the sample size and the estimator can still be accurate, provided the target function has such $\ell^1$ controls and that the sample size is at least moderately large compared to $L^3\log d$. The heart of the analysis is the development of a sampling strategy that demonstrates the accuracy of a sparse covering of deep ramp networks. Lower bounds show that the identified risk is close to being optimal. This is joint work with Andrew R. Barron.

## Accelerated Stochastic Algorithms for Nonconvex Finite-sum and Multi-block Optimization

**Speaker: **George Lan (Georgia Tech)

**Abstract: **We present new stochastic methods for solving two important classes of nonconvex optimization problems. We first introduce a randomized accelerated proximal gradient (RapGrad) method for solving a class of nonconvex optimization problems whose objective function consists of the summation of m components, and show that it can significantly reduce the number of gradient computations especially when the condition number (i.e., the ratio between the Lipschitz constant and negative curvature) is large. Inspired by RapGrad, we also develop a new randomized accelerated proximal dual (RapDual) method for solving a class of multi-block nonconvex optimization problems coupled with linear constraints. We demonstrate that RapDual can also save significantly the number of block updates than its deterministic counterpart. To the best of our knowledge, all the complexity results associated with RapGrad and RapDual seem to be new in the literature. We also illustrate potential advantages of these algorithms through our preliminary numerical experiments.

## Towards Theoretical Understanding of Overparametrization in Deep Learning

**Speaker: **Jason Lee (USC)

**Abstract: **We provide new theoretical insights on why over-parametrization is effective in learning neural networks. For a k hidden node shallow network with quadratic activation and n training data points, we show as long as k>=sqrt(2n), over-parametrization enables local search algorithms to find a \emph{globally} optimal solution for general smooth and convex loss functions. Further, despite that the number of parameters may exceed the sample size, we show with weight decay, the solution also generalizes well.

Next, we analyze the implicit regularization effects of various optimization algorithms. In particular we prove that for least squares with mirror descent, the algorithm converges to the closest solution in terms of the bregman divergence. For linearly separable classification problems, we prove that the steepest descent with respect to a norm solves SVM with respect to the same norm. For over-parametrized non-convex problems such as matrix sensing or neural net with quadratic activation, we prove that gradient descent converges to the minimum nuclear norm solution, which allows for both meaningful optimization and generalization guarantees.

This is joint work with Suriya Gunasekar, Mor Shpigel, Daniel Soudry, Nati Srebro, and Simon Du.

## Prediction with Confidence – a General Framework for Predictive Inference and Uncertainty Quantification

**Speaker: **Regina Y. Liu (Rutgers)

**Abstract: **We propose a general framework for prediction in which a prediction is in the form of a distribution function, called ‘predictive distribution function’. This predictive distribution function is well suited to prescribing the notion of confidence under the frequentist interpretation, and it can provide meaningful answers for prediction-related questions. Its very form of a distribution function also lends itself as a useful tool for quantifying uncertainty in prediction. A general approach under this framework is formulated and illustrated using the so-called confidence distributions (CDs). This CD-based prediction approach inherits many desirable properties of CD, including its capacity to serve as a common platform for directly connecting the existing procedures of predictive inference in Bayesian, fiducial and frequentist paradigms. We discuss the theory underlying the CD-based predictive distribution and related efficiency and optimality issues. We also propose a simple yet broadly applicable Monte-Carlo algorithm for implementing the proposed approach. This concrete algorithm together with the proposed definition and associate theoretical development provide a comprehensive statistical inference framework for prediction. Finally, the approach is demonstrated by simulation studies and a real project on predicting the volume of application submissions to a government agency. The latter shows the applicability of the proposed approach to dependent data settings.

This is joint work with Jieli Shen, Goldman Sachs, and Minge Xie, Rutgers University.

## Two results on generalization error

**Speaker: **Po-Ling Loh (Wisconsin)

**Abstract: **We present two new results bounding the generalization error of empirical risk minimization algorithms. In the first work, we analyze the behavior of iterative algorithms in which noise is added on each iteration, such as stochastic gradient Langevin dynamics (SGLD) and stochastic gradient Hamiltonian Monte Carlo (SGHMC). We provide generalization error bounds on the final output of the algorithm, based on an inequality relating the mutual information between inputs and outputs to the overall generalization error of the algorithm.

In the second work, we present results on bounding the adversarial risk of an algorithm, defined as the expected value of the loss after each input is contaminated by a worst-case perturbation in a small ball around its true value. Our approach proceeds by upper-bounding the adversarial risk with the usual empirical risk of a transformed loss function. We illustrate this method and derive explicit bounds in the case of linear classifiers and neural network classifiers.

This is joint work with Ankit Pensia (UW-Madison), Varun Jog (UW-Madison), and Justin Khim (UPenn).

## On the Generalization and Approximability in Generative Adversarial Networks (GANs)

**Speaker: **Tengyu Ma (Stanford)

**Abstract: **Generative Adversarial Networks (GANs) have become one of the dominant methods for fitting generative models to complicated real-life data. In this talk, we will first introduce the basics of GANs and then discuss the fundamental statistical question about GANs — assuming the training can succeed with polynomial samples, can we have any statistical guarantees for the estimated distributions? In the work with Arora, Ge, Liang, and Zhang, we suggested a dilemma: powerful discriminators cause overfitting, whereas weak discriminators cannot detect mode collapse. Such a conundrum may be solved or alleviated by designing discriminator class with strong distinguishing power against the particular generator class (instead of against all possible generators.)

## Machine Learning and Security: The Good, the Bad, and the Hopeful

**Speaker: **Aleksander Madry (MIT)

**Abstract: ** Machine learning has made a tremendous progress over the last decade. In fact, many believe now that ML techniques are a “silver bullet”, capable of making progress on any real-world problem they are applied to.

But is that really so?

In this talk, I will discuss a major difficulty in the real-world deployment of ML: making our ML solutions be robust and secure. After briefly surveying some of the key challenges in this context, I will focus on one of the most pressing issues: the widespread vulnerability of state-of-the-art deep learning models to adversarial misclassification (aka adversarial examples). I will describe a framework that enables us to reason about this vulnerability in a principled manner as well as develop methods for alleviating the problem it poses.

## A Mean Field View of the Landscape of Two-Layer Neural Networks

**Speaker: **Andrea Montanari (Stanford)

**Abstract: **Multi-layer neural networks are among the most powerful models in machine learning, yet the fundamental reasons for this success defy mathematical understanding. Learning a neural network requires to optimize a non-convex high-dimensional objective (risk function), a problem which is usually attacked using stochastic gradient descent (SGD). Does SGD converge to a global optimum of the risk or only to a local optimum? In the first case, does this happen because local minima are absent, or because SGD somehow avoids them? In the second, why do local minima reached by SGD have good generalization properties?

We consider a simple case, namely two-layers neural networks, and prove that –in a suitable scaling limit– SGD dynamics is captured by a certain non-linear partial differential equation (PDE) that we call distributional dynamics (DD). We then consider several specific examples, and show how DD can be used to prove convergence of SGD to networks with nearly ideal generalization error. This description allows to ‘average-out’ some of the complexities of the landscape of neural networks, and can be used to prove a general convergence result for noisy SGD.

## Simple classification schemes with binary data

**Speaker: **Deanna Needell (UCLA)

**Abstract: ** Binary, or one-bit, representations of data arise naturally in many applications, and are appealing in both hardware implementations and algorithm design. In this talk, we provide a brief background to sparsity and 1-bit measurements, and then present new results on the problem of data classification from binary data that proposes a framework with low computation and resource costs. We illustrate the utility of the proposed approach through stylized and realistic numerical experiments, provide a theoretical analysis for a simple case, several variations of the basic approach, and discuss future directions.

## Statistical theory for deep ReLU networks

**Speaker: **Johannes Schmidt-Hieber (Leiden University)

**Abstract: **We derive risk bounds for fitting deep neural networks to data generated from the multivariate nonparametric regression model. It is shown that estimators based on sparsely connected deep neural networks with ReLU activation function and properly chosen network architecture achieve the minimax rates of convergence (up to logarithmic factors) under a general composition assumption on the regression function. The framework includes many well-studied structural constraints such as (generalized) additive models. While there is a lot of flexibility in the network architecture, the tuning parameter is the sparsity of the network. Specifically, we consider large networks with number of potential parameters being much bigger than the sample size. We also discuss some theoretical results that compare the performance to other methods such as wavelets and spline-type methods. This is joint work with K. Eckle (Leiden).

## The Singular Values of Convolutional Layers

**Speaker: **Hanie Sedghi (Google)

**Abstract: **We characterize the singular values of the linear transformation associated with a standard 2D multi-channel convolutional layer, enabling their efficient computation. This characterization also leads to an algorithm for projecting a convolutional layer onto an operator-norm ball. We show that this is an effective regularizer; for example, it improves the test error of a deep residual network using batch normalization on CIFAR-10 from 6.2\% to 5.3\%.

## Optimization Landscape of Neural Networks: Where Do the Local Minima Hide?

**Speaker: **Ohad Shamir (Weizmann Institute of Science)

**Abstract: **Training neural networks is a highly non-convex optimization problem, which is often successfully solved in practice, but the reasons for this are poorly understood. Much recent work has focused on showing that these non-convex problems do not suffer from poor local minima. However, this has only been provably shown under strong assumptions or in highly restrictive settings. In this talk, I’ll describe some recent results on this topic, both positive and negative. On the negative side, I’ll show how local minima can be ubiquitous even when optimizing simple, one-hidden-layer networks under favorable data distributions. On the flip side, I’ll discuss how looking at other architectures (such as residual units), or modifying the question, can lead to positive results under mild assumptions.

## Diverse Neural Network Learns True Target Functions

**Speaker: **Le Song (Georgia Tech)

**Abstract: **Neural networks are a powerful class of functions that can be trained with simple gradient descent to achieve state-of-the-art performance on a variety of applications. Despite their practical success, there is a paucity of results that provide theoretical guarantees on why they are so effective. Lying in the center of the problem is the difficulty of analyzing the non-convex loss function with potentially numerous local minima and saddle points. Can neural networks corresponding to the stationary points of the loss function learn the true target function? If yes, what are the key factors contributing to such nice optimization properties?

We answer these questions by analyzing one-hidden-layer neural networks with ReLU activation, and show that despite the non-convexity, neural networks with diverse units have no spurious local minima. We bypass the non-convexity issue by directly analyzing the first order optimality condition, and show that the loss can be made arbitrarily small if the minimum singular value of the “extended feature matrix” is large enough. We make novel use of techniques from kernel methods and geometric discrepancy, and identify a new relation linking the smallest singular value to the spectrum of a kernel function associated with the activation function and to the diversity of the units. Our results also suggest a novel regularization function to promote unit diversity for potentially better generalization.

## On the Complexity of Training a Neural Network

**Speaker: **Santosh S Vempala (Georgia Tech)

**Abstract: ** The empirical successes of deep learning currently lack rigorous explanation and the effectiveness of gradient descent in particular is still mostly a mystery. We focus on training neural networks with a single hidden layer of unbiased activation units and find a (surprisingly?) general polynomial-time analysis of gradient descent, exploiting new connections with tools from spherical harmonics. On the other hand, when the target function is generated by a network using arbitrary biases, the problem is intractable. We construct a family of simple neural networks with a single hidden layer, a smooth activation function and benign input distributions that is hard to learn in the sense that any statistical query algorithm (including all known variants of stochastic gradient descent with any loss function, for any model architecture) needs an exponential number of queries. We will discuss a few open problems.

Joint work with John Wilmes (both parts), Le Song and Bo Xie (the lower bound).

## Neumann Networks for Inverse Problems in Imaging

**Speaker: **Rebecca Willett (Chicago)

**Abstract: **In many imaging reconstruction problems, we wish to estimate an image from linear projections, resulting in an ill-posed or underdetermined problem. Examples include deblurring, inpainting, compressed sensing reconstruction, and MRI reconstruction. Traditionally, images were estimated in an optimization framework with a pre-specified regularizer such as Tikhonov regularization or terms that promote sparsity in a known basis. More recent efforts focus on leveraging training images to learn a regularizer using deep neural networks. In this talk, I will illustrate how the sample complexity of many approaches in this vein is dramatically higher than necessary. I will then describe a new approach to learning to solve inverse problems with much lower sample complexity. Our method is inspired by the Neumann series expansion and has a novel network architecture that sidesteps bottlenecks associated with “unrolled” proximal gradient optimization frameworks and yields compelling empirical results in a variety of settings. This is joint work with Davis Gilton and Greg Ongie.

## Approximation Theory of Deep Convolutional Neural Networks

**Speaker: **Dingxuan Zhou (City U. of Hong Kong)

**Abstract: ** Deep learning has been widely applied and brought breakthroughs in speech recognition, computer vision, and many other domains. The involved deep neural network architectures and computational issues have been well studied in machine learning. But there lacks a theoretical foundation for understanding the approximation or generalization ability of deep learning methods with network architectures such as deep convolutional neural networks (CNNs) with convolutional structures. The convolutional architecture gives essential approximation theory differences between the deep CNNs and fully-connected deep neural networks, and the classical approximation theory of fully-connected networks developed around 30 years ago does not apply. This talk describes an approximation theory of deep CNNs. In particular, we show the universality of a deep CNN, meaning that it can be used to approximate any continuous function to an arbitrary accuracy when the depth of the neural network is large enough. Our quantitative estimate, given tightly in terms of the number of free parameters to be computed, verifies the efficiency of deep CNNs in dealing with large dimensional data. Some related distributed learning algorithms will also be discussed.