Carnegie Mellon University

Cluster Architecture

32 x Dell PowerEdge R7525 servers

  • 28 for direct compute, 4 for virtual machines
  • Optimized for GPU-enhanced computation

24 x Dell PowerEdge R7625 servers

  • Optimized for CPU-intensive computation

CPUs: 6,400 cores (physical)

  • 32 x Dual AMD EPYC 7713 (64-core @ 2.0GHz)
  • 24 x Dual AMD EPYC 9474F (48-core @ 3.6GHz)

GPUs: ~322k CUDA cores

  • 28 x NVIDIA Ampere A40 (optimized for direct compute)
  • 4 x NVIDIA Ampere A16 (optimized for VMs)

RAM: >80 TB total installed

  • 32 x 2TB/node, 3200MT/s RDIMMs
  • 24 x 768GB/node, 4800MT/s RDIMMs

Disk: 100TB VAST storage array

  • Expandable, on demand

Network (wide-area)

  • Connected to campus via primary 100GBit/s line
    • Secondary 10GBit/s lines for redundancy
  • Maintains campus IP addresses
  • Login from campus or through VPN, using CMU credentials

Network (cluster internal)

  • 200Gb InfiniBand connecting all compute nodes and storage
  • 100Gb Ethernet connecting all nodes and storage 

Hardware Networking

Ethernet

  • Three (3) Cisco Nexus 9364C Ethernet Switches
    • 64 ports each
    • 100 Gbps per port
    • Supporting hybrid (CPU/GPU) nodes
  • Six (6) Cisco 93600CD-GX Ethernet Switches
    • 36 Ports each
    • 100 Gbps per port
    • Supporting CPU-only nodes

InfiniBand

  • Two NVIDIA/Mellanox Quantum QM8700 InfiniBand Switches
    • 40 ports each
    • 200 Gbps (HDR) per port
    • One switch supporting hybrid (CPU/GPU) nodes
    • One switch supporting CPU-only nodes
    • Switches are independent (each on its own subnet)

Hardware Storage - VAST

  • All-flash storage array
  • 676 TB capable
  • Ethernet and InfiniBand network connectivity (8+8 ports at 100Gb each)
  • Multi protocol – NFS, CIFS, Object
  • Data reduction through dedupe and compression

Software

  • Compute nodes and login nodes – RedHat Enterprise Linux version 8
  • Groups are expected to install their research-specific software locally in their $group folder
    • Each group is separately responsible for the appropriate use of its software and in accordance with applicable licensing
  • Central repository for HPC productivity software is hosted on the VAST array and is available to all users of the cluster
    • Software currently licensed by CMU Central IT can become available if there is sufficient interest
  • Virtual Machine environment – VMware ESXi 7.0_U3
  • Questions regarding software installation should be sent to trace-it-help@andrew.cmu.edu
  •  

Storage

Group storage

  • 250 GB group-wide quota (default)
  • Expanded quota may be requested at 1TB increments
  • Groups are created through the allocation system
  • All files deleted after a grace period of 90 days after the end of the group allocation
  • No backups
  • Hosted on the VAST array

User storage

  • Available to all users with an active group assignment
  • 25 GB quota (default)
  • Additional quota may be requested as part of group quota expansion
  • Expected use includes batch scripts, source code, parameter files, personally installed software not shared with others
  • Hosted on the VAST array
  • Snapshots of home directories are taken daily. Instructions on recovering files from snapshots can be found at:
    https://cmu-enterprise.atlassian.net/wiki/spaces/TPR/pages/2338652233/File+Spaces