TRACE Capabilities
Architecture, hardware, and software
Cluster Architecture
32 x Dell PowerEdge R7525 servers
- 28 for direct compute, 4 for virtual machines
- Optimized for GPU-enhanced computation
24 x Dell PowerEdge R7625 servers
- Optimized for CPU-intensive computation
CPUs: 6,400 cores (physical)
- 32 x Dual AMD EPYC 7713 (64-core @ 2.0GHz)
- 24 x Dual AMD EPYC 9474F (48-core @ 3.6GHz)
GPUs: ~322k CUDA cores
- 28 x NVIDIA Ampere A40 (optimized for direct compute)
- 4 x NVIDIA Ampere A16 (optimized for VMs)
RAM: >80 TB total installed
- 32 x 2TB/node, 3200MT/s RDIMMs
- 24 x 768GB/node, 4800MT/s RDIMMs
Disk: 100TB VAST storage array
- Expandable, on demand
Network (wide-area)
- Connected to campus via primary 100GBit/s line
- Secondary 10GBit/s lines for redundancy
- Maintains campus IP addresses
- Login from campus or through VPN, using CMU credentials
Network (cluster internal)
- 200Gb InfiniBand connecting all compute nodes and storage
- 100Gb Ethernet connecting all nodes and storage
Hardware Networking
Ethernet
- Three (3) Cisco Nexus 9364C Ethernet Switches
- 64 ports each
- 100 Gbps per port
- Supporting hybrid (CPU/GPU) nodes
- Six (6) Cisco 93600CD-GX Ethernet Switches
- 36 Ports each
- 100 Gbps per port
- Supporting CPU-only nodes
InfiniBand
- Two NVIDIA/Mellanox Quantum QM8700 InfiniBand Switches
- 40 ports each
- 200 Gbps (HDR) per port
- One switch supporting hybrid (CPU/GPU) nodes
- One switch supporting CPU-only nodes
- Switches are independent (each on its own subnet)
Hardware Storage - VAST
- All-flash storage array
- 676 TB capable
- Ethernet and InfiniBand network connectivity (8+8 ports at 100Gb each)
- Multi protocol – NFS, CIFS, Object
- Data reduction through dedupe and compression
Software
- Compute nodes and login nodes – RedHat Enterprise Linux version 8
- Groups are expected to install their research-specific software locally in their $group folder
- Each group is separately responsible for the appropriate use of its software and in accordance with applicable licensing
- Central repository for HPC productivity software is hosted on the VAST array and is available to all users of the cluster
- Software currently licensed by CMU Central IT can become available if there is sufficient interest
- Software version management provided by modules
- Documentation is available at: https://cmu-enterprise.atlassian.net/wiki/spaces/TPR/pages/2338652272/Software
- Virtual Machine environment – VMware ESXi 7.0_U3
- Questions regarding software installation should be sent to trace-it-help@andrew.cmu.edu
Storage
Group storage
User storage
- Available to all users with an active group assignment
- 25 GB quota (default)
- Additional quota may be requested as part of group quota expansion
- Expected use includes batch scripts, source code, parameter files, personally installed software not shared with others
- Hosted on the VAST array
- Snapshots of home directories are taken daily. Instructions on recovering files from snapshots can be found at:
https://cmu-enterprise.atlassian.net/wiki/spaces/TPR/ pages/2338652233/File+Spaces