AAAI-22 Workshop: Learning Network Architecture During Training

at the 36^th Conference on Artificial Intelligence

Workshop Date: February 28

Workshop Description

A fundamental problem in the use of artificial neural networks is that the first step is to guess the network architecture. Fine tuning a neural network is very time consuming and far from optimal. Hyperparameters such as the number of layers, the number of nodes in each layer, the pattern of connectivity, and the presence and placement of elements such as memory cells, recurrent connections, and convolutional elements are all manually selected. If it turns out that the architecture is not appropriate for the task, the user must repeatedly adjust the architecture and retrain the network until an acceptable architecture has been obtained.

There is now a great deal of interest in finding better alternatives to this scheme. Options include pruning a trained network or training many networks automatically. In this workshop we focus on a contrasting approach: to learn the architecture during training. This topic encompasses forms of Neural Architecture Search (NAS) in which the performance properties of each architecture, after some training, are used to guide the selection of the next architecture to be tried. This topic also encompasses techniques that augment or alter a network as the network is trained. An example of the latter is the Cascade Correlation algorithm, as well as others that incrementally build or modify a neural network during training, as needed for the problem at hand.

Main Objectives of the Workshop

Our goal is to build a stronger community of researchers exploring these methods, and to find synergies among these related approaches and alternatives. Eliminating the need to guess the right topology in advance of training is a prominent benefit of learning network architecture during training. Additional advantages are possible, including decreased computational resources to solve a problem, reduced time for the trained network to make predictions, reduced requirements for training set size, and avoiding “catastrophic forgetting.” We would especially like to highlight approaches that are qualitatively different from current popular, but computationally intensive, NAS methods.

As deep learning problems become increasingly complex, network sizes must increase and other architectural decisions become critical to success. The deep learning community must often confront serious time and hardware constraints from suboptimal architectural decisions. The growing popularity of NAS methods demonstrates the community’s hunger for better ways of choosing or evolving network architectures that are well-matched to the problem at hand.

Topics

Methods for learning network architecture during training, including Incrementally building neural networks during training, new performance benchmarks for the above. Novel approaches, preliminary results, and works in progress are encouraged.

Schedule

	Speaker	Time (ET)
Welcome and Introduction	Scott Fahlman	10:00:00 AM
Invited Talk: Training Neural Networks with Local Error Signals	Lars Eidnes	10:15 AM
Invited Talk: Gradient Boosting Neural Networks: GrowNet	Sarkhan Badirli	11:00 AM
Invited Talk: Two training sins: greedy and lazy?	Edouard Oyallon	11:45 AM
Break		12:30 PM
An Approach for Efficient Neural Architecture Search Space Definition	Léo Pouy	01:00 PM
Enhanced Exploration in Neural Feature Selection for Deep Click-Through Rate Prediction Models via Ensemble of Gating Layers	Lin Guan (pre-recorded)	01:30 PM
Approximate Bayesian Optimisation for Neural Networks	Nadhir Hassen (pre-recorded)	02:00 PM
AdaSearch: Many-to-One Unified Neural Architecture Search via A Smooth Curriculum	Chunhui Zhang	02:30 PM
Proxyless Neural Architecture Adaptation for Supervised Learning and Self-Supervised Learning	Do-Guk Kim, Heung-Chang Lee (pre-recorded)	03:00 PM
Concluding Remarks		03:30 PM

All times are Eastern US (ET) (UTC -5).

Videos will be made available after the workshop.

Invited Talks

Training Neural Networks with Local Error Signals

Neural networks are usually trained by global backpropagation, where a gradient is backpropagated through the network. In this work, we present an alternative approach that instead trains networks layer-wise, without propagating the gradient between layers. We use single-layer sub-networks and two different supervised loss functions to generate local error signals for the hidden layers. Networks trained this way approach the state-of-the-art on a variety of image datasets. Using local errors could be a step towards more biologically plausible deep learning because the global error does not have to be transported back to hidden layers.

Lars Eidnes

Gradient Boosting Neural Networks: GrowNet

A novel gradient boosting framework is proposed where shallow neural networks are employed as ``weak learners''. General loss functions are considered under this unified framework with specific examples presented for classification, regression, and learning to rank. A fully corrective step is incorporated to remedy the pitfall of greedy function approximation of classic gradient boosting decision tree. The proposed model rendered outperforming results against state-of-the-art boosting methods in all three tasks on multiple datasets. An ablation study is performed to shed light on the effect of each model components and model hyperparameters.

Sarkhan Badirli

Two training sins: greedy and lazy?

The optimization of typical deep neural network is difficult to analyze because the objective loss is non convex and the objective of intermediary layers is not specified at training time. I will present two ideas which challenge the common intuitions about deep training procedures. The first consists in the idea of lazy training [1], which corresponds to a setting in which a deep network behaves similarly as its linearization at initialization (work with Lénaïc Chizat and Francis Bach). Secondly, I will discuss greedy training [2], a procedure in which each intermediary layer objective is explicitly specified (work with Eugene Belilovsky and Michael Eickenberg). Applications to remove the locks of backward, update or forward [3] at training time will be discussed, through the scope of a decoupled learning procedure [4].

Edouard Oyallon

Papers Submitted and Accepted for the Workshop

An Approach for Efficient Neural Architecture Search Space Definition

As we advance in the fast-growing era of Machine Learning, various new and more complex neural architectures are arising to tackle problem more efficiently. On the one hand their efficient usage requires advanced knowledge and expertise, which is most of the time difficult to find on the labor market. On the other hand, searching for an optimized neural architecture is a time-consuming task when it is performed manually using a trial and error approach. Hence, a method and a tool support is needed to assist users of neural architectures, leading to an eagerness in the field of Automatic Machine Learning (AutoML). When it comes to Deep Learning, an important part of AutoML is the Neural Architecture Search (NAS). In this paper, we propose a novel cell-based hierarchical search space, easy to comprehend and manipulate. The objectives of the proposed approach are to optimize the search-time and to be general enough to handle most of state of the art Convolutional Neural Networks (CNN) architectures.

Presented by: Léo Pouy

Enhanced Exploration in Neural Feature Selection for Deep Click-Through Rate Prediction Models via Ensemble of Gating Layers

Feature selection has been an essential step in developing industry-scale deep Click-Through Rate (CTR) prediction systems. The goal of neural feature selection (NFS) is to choose a relatively small subset of features with the best explanatory power as a means to remove redundant features and reduce computational cost. Inspired by gradient-based neural architecture search (NAS) and network pruning methods, people have tackled the NFS problem with Gating approach that inserts a set of differentiable binary gates to drop less informative features. The binary gates are optimized along with the network parameters in an efficient end-to-end manner. In this paper, we analyze the gradient-based solution from an exploration-exploitation perspective and use empirical results to show that Gating approach might suffer from insufficient exploration. To improve the exploration capacity of gradient-based solution, we propose a simple but effective ensemble learning approach, named Ensemble Gating. We choose two public datasets, namely Avazu and Criteo, to evaluate this approach. Our experiments show that, without adding any computational overhead or introducing any hyper-parameter (except the size of the ensemble), our method is able to consistently improve Gating approach and find a better subset of features on the two datasets with three different underlying deep CTR prediction models.

Presented by: Lin Guan

Approximate Bayesian Optimisation for Neural Networks

A body of work has been done to automate machine learn- ing algorithms and to highlight the importance of model choice. Automating the process of choosing the best forecast- ing model and its corresponding parameters can result to improve a wide range of real-world applications. Bayesian optimisation (BO) uses a black-box optimisation method to pro- pose solutions according to an exploration-exploitation trade-off criterion through acquisition function. BO framework imposes two key ingredients: a probabilistic surrogate model that consists of prior belief of the unknown dynamic of the model and an objective function that describes how optimal the model-fit. Choosing the best model and its associated hyperparameters can be very expensive, and is typically fit using Gaussian processes (GPs). However, since GPs scale cubically with the number of observations, it has been challenging to handle objectives whose optimisation requires many evaluations. In addition, most real-datasets are non-stationary which makes idealistic assumptions on surrogate models. The necessity to solve the analytical tractability and the computational feasibility in a stochastic fashion enables to ensure the efficiency and the applicability of Bayesian optimisation. In this paper we explore the use of approximate inference with Bayesian Neural Networks as an alternative to GPs to model distributions over functions. Our contribution is to provide a link between density-ratio estimation and class probability estimation based on approximate inference, this reformula- tion provides algorithm efficiency and tractability.

Presented by: Nadhir Hassen

AdaSearch: Many-to-One Unified Neural Architecture Search via A Smooth Curriculum

The cost of Neural Architecture Search (NAS) has been largely reduced, thanks to one-shot SuperNet methods that exploit the weight-sharing strategy as the proxy. However, challenges remain especially when we target for simultaneously deriving many high-performance, yet diverse models that can meet various resource constraints, by training just one SuperNet. If we treat SuperNet as a large ensemble of all those candidate neural networks with highly varied complexity levels, a fundamental question then arises: how might we train those neural networks all (close) to their optimal performance, with their weights entangled due to sharing, under one unified optimization procedure? To tackle this question, we propose AdaSearch, whose idea is inspired from curriculum learning. Towards our end goal of training all neural networks well, we start by focusing our SuperNet training on a ``simple'' subset of them, i.e., training a shallow SuperNet that consists of only low-complexity neural networks. We then iteratively grow and train the SuperNet, so that higher-complexity neural networks are gradually included and taken care of. For smoothly transiting the SuperNet curriculum, we also develop a key enabling technique called SuperNet2SuperNet, that is, using distillation to initialize the deeper SuperNet by inheriting knowledge from the shallower one each time. AdaSearch demonstrates state-of-the-art accuracy-efficiency trade-offs on ImageNet, while significantly trimming down search GPU hours and CO_2 emission by reducing N search times to 1 procedure.

Presented by: Chunhui Zhang

Proxyless Neural Architecture Adaptation for Supervised Learning and Self-Supervised Learning

Recently, Neural Architecture Search (NAS) methods are introduced and show impressive performance on many benchmarks. Among those NAS studies, Neural Architecture Transformer (NAT) aims to adapt the given neural architecture to improve performance while maintaining computational costs. However, NAT lacks reproducibility and it requires an additional architecture adaptation process before network weight training. In this paper, we propose proxyless neural architecture adaptation that is reproducible and efficient. Our method can be applied to both supervised learning and self-supervised learning. The proposed method shows stable performance on various architectures. Extensive reproducibility experiments on two datasets, i.e., CIFAR-10 and Tiny Imagenet, present that the proposed method definitely outperforms NAT and be applicable to other models and datasets.

Presented by: Do-Guk Kim

Workshop Format

Our single day workshop featuring invited speakers, panels, poster sessions, and presentations will be held online and open to all. At least one author of each accepted submission must register and present the paper online at the workshop.

Organizing Committee

Scott E. Fahlman
School of Computer Science, Carnegie Mellon University

Edouard Oyallon
Sorbonne Université – LIP6

Dean Alderucci
School of Computer Science, Carnegie Mellon University dalderuc@cs.cmu.edu