→
America/New_York

Description

**Speaker:** David Donoho

Department of Statistics, Stanford University

**Title:** Prevalence of Neural Collapse during the terminal phase of deep learning training

**Abstract:**Modern practice for training classification deepnets involves a Terminal Phase of Training (TPT), which begins at the epoch where training error first vanishes; During TPT, the training error stays effectively zero while training loss is pushed towards zero.

Direct measurements of TPT, for three prototypical deepnet architectures and across seven canonical classification datasets, expose a pervasive inductive bias we call Neural Collapse, involving four deeply interconnected phenomena: (NC1) Cross-example within-class variability of last-layer training activations collapses to zero, as the individual activations themselves collapse to their class-means; (NC2) The class-means collapse to the vertices of a Simplex Equiangular Tight Frame (ETF); (NC3) Up to rescaling, the last-layer classifiers collapse to the class-means, or in other words to the Simplex ETF, i.e. to a self-dual configuration; (NC4) For a given activation, the classifier’s decision collapses to simply choosing whichever class has the closest train class-mean, i.e. the Nearest Class-Center (NCC) decision rule.

The symmetric and very simple geometry induced by the TPT confers important benefits, including better generalization performance, better robustness, and better interpretability.

This is joint work with Vardan Papyan, U Toronto and XY Han, Cornell

**Reference papers:**

Prevalence of neural collapse during the terminal phase of deep learning training

Papyan, V., Han, X.Y. and Donoho, D.L., 2020.

Proceedings of the National Academy of Sciences, 117(40), pp.24652-24663

arXiv:2008.08186

The optimised internal representation of multilayer classifier networks performs nonlinear discriminant analysisAR Webb, D Lowe – Neural Networks, 1990 – Elsevier

This paper illustrates why a nonlinear adaptive feed-forward layered network with linear output units can perform well as a pattern classification device. The central result is that minimizing the error at the output of the network is equivalent to maximizing a particular … Cited by 165 Related articles All 6 versions

The implicit bias of gradient descent on separable data

D Soudry, E Hoffer, MS Nacson, S Gunasekar… – arXiv:1710.10345

We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the max-margin (hard margin SVM) solution. The result also …

Risk and parameter convergence of logistic regression

Z Ji, M Telgarsky – arXiv preprint arXiv:1803.07300, 2018 – arxiv.org

Gradient descent, when applied to the task of logistic regression, outputs iterates which are biased to follow a unique ray defined by the data. The direction of this ray is the maximum margin predictor of a maximal linearly separable subset of the data; the gradient descent iterates converge to this ray in direction at the rate $\mathcal {O}(\ln\ln t/\ln t) $. The ray does not pass through the origin in general, and its offset is the bounded global optimum of the risk over the remaining data; gradient descent recovers this offset at a rate $\mathcal {O}((\ln

Theory III: Dynamics and Generalization in Deep Networks–a simple solution

A Banburski, Q Liao, B Miranda, L Rosasco … – arXiv preprint 1903.04991, 2019 – arxiv.org

Classical generalization bounds for classification in the setting of separable data can be optimized by maximizing the margin of a deep network under the constraint of unit p-norm of the weight matrix at each layer. A possible approach for solving numerically this problem …

Gradient descent maximizes the margin of homogeneous neural networks

K Lyu, J Li – arXiv preprint arXiv:1906.05890, 2019 – arxiv.org

In this paper, we study the implicit regularization of the gradient descent algorithm in homogeneous neural networks, including fully-connected and convolutional neural networks with ReLU or LeakyReLU activations. In particular, we study the gradient descent or gradient flow (ie, gradient descent with infinitesimal step size) optimizing the logistic loss or cross entropy loss of any homogeneous model (possibly non-smooth), and show that if the training loss decreases below a certain threshold, then we can define a smoothed …

Traces of Class/Cross-Class Structure Pervade Deep Learning Spectra

Authors: Vardan Papyan

Abstract: Numerous researchers recently applied empirical spectral analysis to the study of modern deep learning classifiers. We identify and discuss an important formal class/cross-class structure and show how it lies at the origin of the many visually striking features observed in deepnet spectra, some of which were reported in recent articles, others are unveiled here for the first time….

The agenda of this meeting is empty