Carnegie Mellon University

Webinar - Technical Overview of the Cerebras CS-1, the AI Compute Engine for Neocortex

Presented on Wednesday, August 19, 2020, 2:00 - 3:00 pm (ET), by Natalia Vassilieva, Ph.D. (Cerebras Systems Inc.)

In this webinar, we offer a technical deep dive into the Cerebras CS-1 system, the Wafer Scale Engine (WSE) chip, and the Cerebras Graph Compiler. The Cerebras CS-1 is the groundbreaking specialized AI hardware technology to be featured on Neocortex, PSC’s upcoming NSF-funded AI supercomputer. Neocortex, planned for deployment in late 2020, will enable on-chip model and data parallelism and will significantly accelerate the training of ambitious deep learning.

For more information about Neocortex, please visit https://www.cmu.edu/psc/aibd/neocortex/.

View slides

Table of Contents

01:58 - Introduction
02:53 - Neocortex: System Overview
08:00 - Cerebras CS-1: Technical Overview
44:32 - Q&A

Webinar Q&A

We are posting every single question that was formulated over the webinar. Some of them are very similar and may have the same answer.

Access

Please see the following page for complete information about the Neocortex Early User Program.

Cerebras CS-1 Hardware

The Neocortex system architecture directly connects SD Flex to CS1 systems via twelve 100GbE links. We anticipate very low latency, high bandwidth performance of the system with 100GbE latency as a lower bound.

Additionally, the CS1 IO system was designed to transfer inbound TCP/IP traffic from its 100GbE links to our on-wafer protocol at line rate. We don’t have measured latency figures for end-to-end SD Flex to CS1 yet, but we can share those when we connect and deploy the systems.

Design for our first-generation WSE began more than two years ago. At that time we were working with 16nm as the cutting-edge process node with proven tooling and methods. It's a good foundation for a new wafer-scale integration project that still yields vastly higher performance for AI than contemporary processors.

That said, we are actively investigating and working with emergent process nodes such as 7nm for our subsequent generation WSE processors. See https://www.zdnet.com/article/cerebras-teases-second-generation-wafer-scale-ai-chip/ for a write-up of our recent announcement at Hot Chips about our 2nd generation device. More coming soon!

Design for our first-generation WSE began more than two years ago. At that time we were working with 16nm as the cutting-edge process node with proven tooling and methods. It's a good foundation for a new wafer-scale integration project that still yields vastly higher performance for AI than contemporary processors.

That said, we are actively investigating and working with emergent process nodes such as 7nm for our subsequent generation WSE processors. See https://www.zdnet.com/article/cerebras-teases-second-generation-wafer-scale-ai-chip/ for a write-up of our recent announcement at Hot Chips about our 2nd generation device. More coming soon!

Design for our first-generation WSE began more than two years ago. At that time we were working with 16nm as the cutting-edge process node with proven tooling and methods. It's a good foundation for a new wafer-scale integration project that still yields vastly higher performance for AI than contemporary processors.

That said, we are actively investigating and working with emergent process nodes such as 7nm for our subsequent generation WSE processors. See https://www.zdnet.com/article/cerebras-teases-second-generation-wafer-scale-ai-chip/ for a write-up of our recent announcement at Hot Chips about our 2nd generation device. More coming soon!
Yes, the shared memory is across the full wafer-scale processor.
Each WSE is designed with multiple layers of hardware and software redundancy (such as extra, redundant cores and lines of communication) that allow us to protect against and route around expected manufacturing defects in the chip, and effectively yield every wafer we get back from the fab into a production system.
Each WSE is designed with multiple layers of hardware and software redundancy (such as extra, redundant cores and lines of communication) that allow us to protect against and route around expected manufacturing defects in the chip, and effectively yield every wafer we get back from the fab into a production system.

The WSE has been designed for datacenter-grade uptime and reliability. Once powered up, qualified, and tested for production delivery, we expect years of successful operation. The redundancy we have built in to accommodate manufacturing defects can also detect and mitigate any issues that occur during use at the sub-wafer level.

We have also provisioned redundant, hot-swappable subsystem components where failure is more likely to occur, e.g. power supplies, fans with moving parts. These are redundant and easy to replace in the field to keep the CS1 up and running.

It's actually much denser than a motherboard. And yes, the Cerebras engineers did a great job with cooling innovations!
This information is not currently public. Please, feel free to contact Cerebras Systems directly to learn more on this topic
We are actively investigating and working with emergent process nodes such as 7nm for our subsequent generation WSE processors. See https://www.zdnet.com/article/cerebras-teases-second-generation-wafer-scale-ai-chip/ for a write-up of our recent announcement at Hot Chips about our 2nd generation device. More coming soon!

Potential new approaches to computing such as DNA-based are interesting and exciting, but they're outside the scope of this project.

Cerebras CS-1 Software

Yes. The compilation time (as described in the slides) does vary from model to model. We have example models of various architectures for the user to try out.
Answered live: [51:50]
Answered in combination with the above question: [51:50]
The Cerebras compiler can operate on any given subsection of the full fabric - it will place a given model within a given budget of cores. When it is desirable to run multiple models on CS-1 at the same time, the WSE will be partitioned into sections, and for every model the compiler will be using only portion of the full fabric.
The default mode of execution on CS-1 is with pre-compiled static graphs. When dynamic graphs are needed beacause of inputs of different sizes (like sequences with variable length, or graphs with variable number of nodes), we leverage the ability of our hardware to harvest sparsity for efficient processing of padded inputs. For other cases, the model (or parts of the model) can be re-compiled a few times throught the training.
Answered live: [48:38]

Applications

There are no limits on input/output dimensionality on its own, rather overall memory and IO limits of the system. So one should consider memory footprint of a model - weights, gradients, optimizer states and "in-flight" activations should fit into the WSE memory. Remember, the WSE is a dataflow architecture, so typically we have very few training samples "in-flight" even when training with large batch sizes. Also, one should consider expected throughput on the WSE to derived required IO bandwidth to stream training samples.
Answered live: [56:56]
Early support wil focus on TensorFlow and PyTorch, based on their wide popularity for deep learning. Complementary aspects of workflows expressed in R will be well-supported on Bridges-2, with which PSC's Neocortex system will be federated.
The system and stack are designed for DNN acceleration; however, if a spiking NN could be expressed in DL frameworks such as TF and PT, it might also benefit.

We haven't done yet in-depth analysis for spiking neural networks and even for traditional ANNs there is no single answer to this question. The maximum possible number of neurons depends on many other factors - type of neurons, input size, choice of optimization algorithm.

For traditional ANNs we keep in memory all model parameters, gradients, optimizer states and in-flight activations. For Spiking Neural Networks there will be lower memory footprint for all activations (they need not be kept for backpropagation) and for communicated activations (they are spikes-binary and sparse), but higher memory requirements for neurons, as spiking neurons are dynamical systems with recurrent dynamics of a number of hidden states per neuron (the number of states is dependent on the complexity of the neuron model).

Further, numerical precision of state variables and the dynamics determine how much compute and memory one specific kind of spiking neurons needs. Moreover, SNNs often come with neural modulations that require additional resources. The interplay of these factors makes it difficult to predict the exact memory consumption for an arbitrary SNN, and thus tell what is the maximum number of spiking neurons and synapses that one can fit on the WSE.

Yes, PSC's Neocortex and Bridges-2 systems will be closely federated to enable interoperation of AI with HPC simulation ("cognitive simulation"), full data analytic and machine workflows, online AI and analysis of streaming data, and analytics on large-scale data from instruments.

Performance

The CS-1 supports aceleration of fine-grain sparsity patterns (i.e. single tensor elements). However, either deterministic or statistical structure would influence actual speedup.
Under mild data conditions, the speedup scales with sparsity. Very high sparsity leads to high acceleration.
Sparsity levels and sparsity patterns do factor into the compilation and optimization process so as to achieve the best speedup. In other words, the same model with different sparsity level and/or pattern might result in different optimal hardware placement.
2x speed-up was achieved by combining ReLU and 25% of induced sparsity. ReLU on its own led to >1.5x speedup. Theoretically, without ReLU, 25% of sparsity should give 1.5x, and 50% of sparsity should give 2x speedup. Because we induced sparsity while using ReLU, overall sparsity was higher than 25% and 50%, which resulted in higher speed-ups.
Sparse patterns in the case of CS-1 are deterministic instead of random: sparse tensors are streamed in compressed sparse row (CSR) format.
CS-1 can harvest both arbitrary fine-grain sparsity patterns and regularily structured sparsity patterns, while NVIDIA demonstrated with the Ampere efficient computation with regularly structured sparse weight tensors only, and only for inference (so far). We demonstrate with CS-1 speedups by harvesting sparsity in activations and gradients. The user of CS-1 has the freedom to choose between arbitrary fine-grain and regularly structured sparsity patterns. Regularly structured sparsity would result in more favorable speedup.

Answered live: [53:00]

Yes, the speedups we observe on user workloads are mostly in comparison with V100. Most of our users have been running on V100s before porting to CS-1.
Answered live: [53:00]
This information is not currently public. Please, feel free to contact Cerebras systems directly to learn more on this topic.