Webinar - Technical Overview of the Cerebras CS-1, the AI Compute Engine for Neocortex

Presented on Wednesday, August 19, 2020, 2:00 - 3:00 pm (ET), by Natalia Vassilieva, Ph.D. (Cerebras Systems Inc.)

In this webinar, we offer a technical deep dive into the Cerebras CS-1 system, the Wafer Scale Engine (WSE) chip, and the Cerebras Graph Compiler. The Cerebras CS-1 is the groundbreaking specialized AI hardware technology to be featured on Neocortex, PSC’s upcoming NSF-funded AI supercomputer. Neocortex, planned for deployment in late 2020, will enable on-chip model and data parallelism and will significantly accelerate the training of ambitious deep learning.

For more information about Neocortex, please visit https://www.cmu.edu/psc/aibd/neocortex/.

View slides

Table of Contents
01:58 - Introduction 02:53 - Neocortex: System Overview 08:00 - Cerebras CS-1: Technical Overview 44:32 - Q&A

News and Events

Event List

Contact Us

Send us an email at neocortex@psc.edu.

Newsletter Sign-Up

Join the Neocortex-Updates mailing list.

Webinar Q&A

We are posting every single question that was formulated over the webinar. Some of them are very similar and may have the same answer.

Access

Can you describe the early user program, or provide link to information?

Please see the following page for complete information about the Neocortex Early User Program.

Cerebras CS-1 Hardware

The System Overview slide shows the bandwidth between the parts of the system. Could you also tell us what the expected latency is from HPE to CS-1?

The Neocortex system architecture directly connects SD Flex to CS1 systems via twelve 100GbE links. We anticipate very low latency, high bandwidth performance of the system with 100GbE latency as a lower bound.

Additionally, the CS1 IO system was designed to transfer inbound TCP/IP traffic from its 100GbE links to our on-wafer protocol at line rate. We don’t have measured latency figures for end-to-end SD Flex to CS1 yet, but we can share those when we connect and deploy the systems.

Interesting to see that 16 nm process is used when there are smaller processes available now. Any reason for that?

Design for our first-generation WSE began more than two years ago. At that time we were working with 16nm as the cutting-edge process node with proven tooling and methods. It's a good foundation for a new wafer-scale integration project that still yields vastly higher performance for AI than contemporary processors.

That said, we are actively investigating and working with emergent process nodes such as 7nm for our subsequent generation WSE processors. See https://www.zdnet.com/article/cerebras-teases-second-generation-wafer-scale-ai-chip/ for a write-up of our recent announcement at Hot Chips about our 2nd generation device. More coming soon!

I would expect higher yields at 16nm (just guessing).

Design for our first-generation WSE began more than two years ago. At that time we were working with 16nm as the cutting-edge process node with proven tooling and methods. It's a good foundation for a new wafer-scale integration project that still yields vastly higher performance for AI than contemporary processors.

That said, we are actively investigating and working with emergent process nodes such as 7nm for our subsequent generation WSE processors. See https://www.zdnet.com/article/cerebras-teases-second-generation-wafer-scale-ai-chip/ for a write-up of our recent announcement at Hot Chips about our 2nd generation device. More coming soon!

Guessing yield...

Design for our first-generation WSE began more than two years ago. At that time we were working with 16nm as the cutting-edge process node with proven tooling and methods. It's a good foundation for a new wafer-scale integration project that still yields vastly higher performance for AI than contemporary processors.

That said, we are actively investigating and working with emergent process nodes such as 7nm for our subsequent generation WSE processors. See https://www.zdnet.com/article/cerebras-teases-second-generation-wafer-scale-ai-chip/ for a write-up of our recent announcement at Hot Chips about our 2nd generation device. More coming soon!

Is that shared memory across all chips?

Yes, the shared memory is across the full wafer-scale processor.

Doesn’t using a single wafer for one system produce poor yields.

Each WSE is designed with multiple layers of hardware and software redundancy (such as extra, redundant cores and lines of communication) that allow us to protect against and route around expected manufacturing defects in the chip, and effectively yield every wafer we get back from the fab into a production system.

What happens if there is a silicon defect? Does the software label cores or routes bad and not use?

Each WSE is designed with multiple layers of hardware and software redundancy (such as extra, redundant cores and lines of communication) that allow us to protect against and route around expected manufacturing defects in the chip, and effectively yield every wafer we get back from the fab into a production system.

Is there data about the failure rate of this wafer scale engine/chip (compared to say other cpus/gpus)? This has a single point of failure for the whole system.

The WSE has been designed for datacenter-grade uptime and reliability. Once powered up, qualified, and tested for production delivery, we expect years of successful operation. The redundancy we have built in to accommodate manufacturing defects can also detect and mitigate any issues that occur during use at the sub-wafer level.

We have also provisioned redundant, hot-swappable subsystem components where failure is more likely to occur, e.g. power supplies, fans with moving parts. These are redundant and easy to replace in the field to keep the CS1 up and running.

So, your system has to cool a “mother board” producing about 20KW? If so, that is impressive.

It's actually much denser than a motherboard. And yes, the Cerebras engineers did a great job with cooling innovations!

What is the word length of the FMAC input operands (FP32, FP16, INT8 etc)?

This information is not currently public. Please, feel free to contact Cerebras Systems directly to learn more on this topic

Any plan on using the processor smaller than 16 nm or for that matter smaller than optical diffraction limit by looking into bacteria and virus assembly

We are actively investigating and working with emergent process nodes such as 7nm for our subsequent generation WSE processors. See https://www.zdnet.com/article/cerebras-teases-second-generation-wafer-scale-ai-chip/ for a write-up of our recent announcement at Hot Chips about our 2nd generation device. More coming soon!

Potential new approaches to computing such as DNA-based are interesting and exciting, but they're outside the scope of this project.

Cerebras CS-1 Software

Did you evaluate the configuration time needed for seting-up the system with the network to be accelerated?

Yes. The compilation time (as described in the slides) does vary from model to model. We have example models of various architectures for the user to try out.

Does the compilation/routing process work with non-sequential models? For instance, functional-type networks in TF which may have multiple outputs and loss functions

Answered live: [51:50]

Would also like to add: Does the compilation/routing process work with non-sequential models? For instance, functional-type networks in TF which may have multiple outputs and loss functions -- how would the kernal mapping work with these Functional API models?

Answered in combination with the above question: [51:50]

How does placement/routing work with multiple users (multiple executables)?

The Cerebras compiler can operate on any given subsection of the full fabric - it will place a given model within a given budget of cores. When it is desirable to run multiple models on CS-1 at the same time, the WSE will be partitioned into sections, and for every model the compiler will be using only portion of the full fabric.

How efficient is the kernel compilation? It is important to know if using dynamic graphs.

The default mode of execution on CS-1 is with pre-compiled static graphs. When dynamic graphs are needed beacause of inputs of different sizes (like sequences with variable length, or graphs with variable number of nodes), we leverage the ability of our hardware to harvest sparsity for efficient processing of padded inputs. For other cases, the model (or parts of the model) can be re-compiled a few times throught the training.

How does this optimize conditional computation?

Answered live: [48:38]

Applications

What is the limiting factor for dimensionality of input and output? I.e. how many bytes for a single batch of input/output?

There are no limits on input/output dimensionality on its own, rather overall memory and IO limits of the system. So one should consider memory footprint of a model - weights, gradients, optimizer states and "in-flight" activations should fit into the WSE memory. Remember, the WSE is a dataflow architecture, so typically we have very few training samples "in-flight" even when training with large batch sizes. Also, one should consider expected throughput on the WSE to derived required IO bandwidth to stream training samples.

What happens if the network kernels don't all fit onto the wafer?

Answered live: [56:56]

Please, the system support R Language framework?

Early support wil focus on TensorFlow and PyTorch, based on their wide popularity for deep learning. Complementary aspects of workflows expressed in R will be well-supported on Bridges-2, with which PSC's Neocortex system will be federated.

Are you also addressing the models that biologists make in Neuron and related languages?

The system and stack are designed for DNN acceleration; however, if a spiking NN could be expressed in DL frameworks such as TF and PT, it might also benefit.

CS-1 was presented at the 2020 Telluride Neuromorphic Cognition Engineering workshop. What is the maximum number of spiking neurons (eg. Hodgkin Huxley) and synapses that can be implemented on the WSE?

We haven't done yet in-depth analysis for spiking neural networks and even for traditional ANNs there is no single answer to this question. The maximum possible number of neurons depends on many other factors - type of neurons, input size, choice of optimization algorithm.

For traditional ANNs we keep in memory all model parameters, gradients, optimizer states and in-flight activations. For Spiking Neural Networks there will be lower memory footprint for all activations (they need not be kept for backpropagation) and for communicated activations (they are spikes-binary and sparse), but higher memory requirements for neurons, as spiking neurons are dynamical systems with recurrent dynamics of a number of hidden states per neuron (the number of states is dependent on the complexity of the neuron model).

Further, numerical precision of state variables and the dynamics determine how much compute and memory one specific kind of spiking neurons needs. Moreover, SNNs often come with neural modulations that require additional resources. The interplay of these factors makes it difficult to predict the exact memory consumption for an arbitrary SNN, and thus tell what is the maximum number of spiking neurons and synapses that one can fit on the WSE.

How Cerebras software stack interacts with classic HPC stack (MPI etc)? Is it possible to use one or more CS-1 within a larger HPC simulation?

Yes, PSC's Neocortex and Bridges-2 systems will be closely federated to enable interoperation of AI with HPC simulation ("cognitive simulation"), full data analytic and machine workflows, online AI and analysis of streaming data, and analytics on large-scale data from instruments.

Performance

How does sparse compute speed differ with sparsity pattern? Is the optimal pattern some form of block-sparse pattern?

The CS-1 supports aceleration of fine-grain sparsity patterns (i.e. single tensor elements). However, either deterministic or statistical structure would influence actual speedup.

Have you tried with higher levels of sparsity, say 75%-90%?

Under mild data conditions, the speedup scales with sparsity. Very high sparsity leads to high acceleration.

Is the compilation and optimization specific to sparsity patterns (i.e. recompile per pattern) or is it generic sparsity?

Sparsity levels and sparsity patterns do factor into the compilation and optimization process so as to achieve the best speedup. In other words, the same model with different sparsity level and/or pattern might result in different optimal hardware placement.

In the performance comparison with several levels of sparsity in the network, we see that 25% sparsity yields nearly 2x speedup. But why does a 50% sparsity yields less than 2.5x speedup?

2x speed-up was achieved by combining ReLU and 25% of induced sparsity. ReLU on its own led to >1.5x speedup. Theoretically, without ReLU, 25% of sparsity should give 1.5x, and 50% of sparsity should give 2x speedup. Because we induced sparsity while using ReLU, overall sparsity was higher than 25% and 50%, which resulted in higher speed-ups.

Is there a mechanism to inject random noise in an efficient way, similar to your treatment of sparsity?

Sparse patterns in the case of CS-1 are deterministic instead of random: sparse tensors are streamed in compressed sparse row (CSR) format.

How does CS-1 sparsity harvesting compare with NVIDIA Ampere sparse computations?

CS-1 can harvest both arbitrary fine-grain sparsity patterns and regularily structured sparsity patterns, while NVIDIA demonstrated with the Ampere efficient computation with regularly structured sparse weight tensors only, and only for inference (so far). We demonstrate with CS-1 speedups by harvesting sparsity in activations and gradients. The user of CS-1 has the freedom to choose between arbitrary fine-grain and regularly structured sparsity patterns. Regularly structured sparsity would result in more favorable speedup.

Are the performance comparisons vs. V100? which GPU?

Answered live: [53:00]

Yes, the speedups we observe on user workloads are mostly in comparison with V100. Most of our users have been running on V100s before porting to CS-1.

Have you done comparisons with NVIDIA DGX Cluster on Deep Learning training tasks?