Webinar - Technical Overview of the Cerebras CS-1, the AI Compute Engine for Neocortex
Presented on Wednesday, August 19, 2020, 2:00 - 3:00 pm (ET), by Natalia Vassilieva, Ph.D. (Cerebras Systems Inc.)
In this webinar, we offer a technical deep dive into the Cerebras CS-1 system, the Wafer Scale Engine (WSE) chip, and the Cerebras Graph Compiler. The Cerebras CS-1 is the groundbreaking specialized AI hardware technology to be featured on Neocortex, PSC’s upcoming NSF-funded AI supercomputer. Neocortex, planned for deployment in late 2020, will enable on-chip model and data parallelism and will significantly accelerate the training of ambitious deep learning.
For more information about Neocortex, please visit https://www.cmu.edu/psc/aibd/neocortex/.
Table of Contents |
|
01:58 - Introduction |
Webinar Q&A
We are posting every single question that was formulated over the webinar. Some of them are very similar and may have the same answer.
Access
Can you describe the early user program, or provide link to information?
Cerebras CS-1 Hardware
The System Overview slide shows the bandwidth between the parts of the system. Could you also tell us what the expected latency is from HPE to CS-1?
The Neocortex system architecture directly connects SD Flex to CS1 systems via twelve 100GbE links. We anticipate very low latency, high bandwidth performance of the system with 100GbE latency as a lower bound.
Additionally, the CS1 IO system was designed to transfer inbound TCP/IP traffic from its 100GbE links to our on-wafer protocol at line rate. We don’t have measured latency figures for end-to-end SD Flex to CS1 yet, but we can share those when we connect and deploy the systems.
Interesting to see that 16 nm process is used when there are smaller processes available now. Any reason for that?
Design for our first-generation WSE began more than two years ago. At that time we were working with 16nm as the cutting-edge process node with proven tooling and methods. It's a good foundation for a new wafer-scale integration project that still yields vastly higher performance for AI than contemporary processors.
That said, we are actively investigating and working with emergent process nodes such as 7nm for our subsequent generation WSE processors. See https://www.zdnet.com/article/cerebras-teases-second-generation-wafer-scale-ai-chip/ for a write-up of our recent announcement at Hot Chips about our 2nd generation device. More coming soon!
I would expect higher yields at 16nm (just guessing).
Design for our first-generation WSE began more than two years ago. At that time we were working with 16nm as the cutting-edge process node with proven tooling and methods. It's a good foundation for a new wafer-scale integration project that still yields vastly higher performance for AI than contemporary processors.
That said, we are actively investigating and working with emergent process nodes such as 7nm for our subsequent generation WSE processors. See https://www.zdnet.com/article/cerebras-teases-second-generation-wafer-scale-ai-chip/ for a write-up of our recent announcement at Hot Chips about our 2nd generation device. More coming soon!
Guessing yield...
That said, we are actively investigating and working with emergent process nodes such as 7nm for our subsequent generation WSE processors. See https://www.zdnet.com/article/cerebras-teases-second-generation-wafer-scale-ai-chip/ for a write-up of our recent announcement at Hot Chips about our 2nd generation device. More coming soon!
Is that shared memory across all chips?
Doesn’t using a single wafer for one system produce poor yields.
What happens if there is a silicon defect? Does the software label cores or routes bad and not use?
Is there data about the failure rate of this wafer scale engine/chip (compared to say other cpus/gpus)? This has a single point of failure for the whole system.
The WSE has been designed for datacenter-grade uptime and reliability. Once powered up, qualified, and tested for production delivery, we expect years of successful operation. The redundancy we have built in to accommodate manufacturing defects can also detect and mitigate any issues that occur during use at the sub-wafer level.
We have also provisioned redundant, hot-swappable subsystem components where failure is more likely to occur, e.g. power supplies, fans with moving parts. These are redundant and easy to replace in the field to keep the CS1 up and running.
So, your system has to cool a “mother board” producing about 20KW? If so, that is impressive.
What is the word length of the FMAC input operands (FP32, FP16, INT8 etc)?
Any plan on using the processor smaller than 16 nm or for that matter smaller than optical diffraction limit by looking into bacteria and virus assembly
Potential new approaches to computing such as DNA-based are interesting and exciting, but they're outside the scope of this project.
Cerebras CS-1 Software
Did you evaluate the configuration time needed for seting-up the system with the network to be accelerated?
Does the compilation/routing process work with non-sequential models? For instance, functional-type networks in TF which may have multiple outputs and loss functions
Would also like to add: Does the compilation/routing process work with non-sequential models? For instance, functional-type networks in TF which may have multiple outputs and loss functions -- how would the kernal mapping work with these Functional API models?
How does placement/routing work with multiple users (multiple executables)?
How efficient is the kernel compilation? It is important to know if using dynamic graphs.
How does this optimize conditional computation?
Applications
What is the limiting factor for dimensionality of input and output? I.e. how many bytes for a single batch of input/output?
What happens if the network kernels don't all fit onto the wafer?
Please, the system support R Language framework?
Are you also addressing the models that biologists make in Neuron and related languages?
CS-1 was presented at the 2020 Telluride Neuromorphic Cognition Engineering workshop. What is the maximum number of spiking neurons (eg. Hodgkin Huxley) and synapses that can be implemented on the WSE?
We haven't done yet in-depth analysis for spiking neural networks and even for traditional ANNs there is no single answer to this question. The maximum possible number of neurons depends on many other factors - type of neurons, input size, choice of optimization algorithm.
For traditional ANNs we keep in memory all model parameters, gradients, optimizer states and in-flight activations. For Spiking Neural Networks there will be lower memory footprint for all activations (they need not be kept for backpropagation) and for communicated activations (they are spikes-binary and sparse), but higher memory requirements for neurons, as spiking neurons are dynamical systems with recurrent dynamics of a number of hidden states per neuron (the number of states is dependent on the complexity of the neuron model).
Further, numerical precision of state variables and the dynamics determine how much compute and memory one specific kind of spiking neurons needs. Moreover, SNNs often come with neural modulations that require additional resources. The interplay of these factors makes it difficult to predict the exact memory consumption for an arbitrary SNN, and thus tell what is the maximum number of spiking neurons and synapses that one can fit on the WSE.
How Cerebras software stack interacts with classic HPC stack (MPI etc)? Is it possible to use one or more CS-1 within a larger HPC simulation?
Performance
How does sparse compute speed differ with sparsity pattern? Is the optimal pattern some form of block-sparse pattern?
Have you tried with higher levels of sparsity, say 75%-90%?
Is the compilation and optimization specific to sparsity patterns (i.e. recompile per pattern) or is it generic sparsity?
In the performance comparison with several levels of sparsity in the network, we see that 25% sparsity yields nearly 2x speedup. But why does a 50% sparsity yields less than 2.5x speedup?
Is there a mechanism to inject random noise in an efficient way, similar to your treatment of sparsity?
How does CS-1 sparsity harvesting compare with NVIDIA Ampere sparse computations?
Are the performance comparisons vs. V100? which GPU?
Answered live: [53:00]
Yes, the speedups we observe on user workloads are mostly in comparison with V100. Most of our users have been running on V100s before porting to CS-1.