Domain Knowledge

In this section, we will introduce some domain knowledge in this lecture you may touch.

Introduction to the HPC Cluster
Homogeneous and Heterogeneous
What is PVE
Hand-on (demo)

Introduction to the HPC Cluster

An HPC (High-Performance Computing) cluster is a group of networked computers that collaborate to perform tasks requiring significant computational power. These clusters are commonly used in fields such as scientific research, engineering simulations, big data analysis, weather forecasting, and artificial intelligence training—tasks that a single ordinary computer cannot efficiently complete.

Basic Components of an HPC Cluster

Nodes
- An HPC cluster consists of multiple compute nodes, each typically an independent server.
- Nodes are categorized into different types
  - Compute Nodes
    - Responsible for executing the actual computational tasks.
  - Head/Control Node
    - Used for task scheduling, resource management, and user interaction.
  - Storage Nodes
    - Handle data storage and management.
  - Login Nodes
    - Serve as the entry point for users to log in and submit tasks.
What is an HPC cluster | High Performance Computing
High-Speed Interconnect
- Nodes are connected via a high-speed network (e.g., InfiniBand, Ethernet, or fiber optic networks) to enable low-latency and high-throughput data transmission.
- Common network topologies include Fat Tree, Torus, or Mesh.
<aside> <img src="/icons/info-alternate_blue.svg" alt="/icons/info-alternate_blue.svg" width="40px" />

ADDITION

network.nvidia.com

Introduction to InfiniBand™

</aside>
Storage System
- HPC clusters are typically equipped with parallel file systems (e.g., Lustre, GPFS, or BeeGFS) to support large-scale data access.
- Storage can be local disks, Network-Attached Storage (NAS), or Storage Area Network (SAN).
<aside> <img src="/icons/info-alternate_blue.svg" alt="/icons/info-alternate_blue.svg" width="40px" />

ADDITION

Apply the concept of HPC to companies related to data storage

Home

</aside>
Software Stack
- Job Scheduler: Tools like SLURM, PBS, or LSF are used to allocate resources and manage task queues.
- Parallel Programming Tools: Such as MPI (Message Passing Interface) or OpenMP, which enable programs to run in parallel across multiple nodes.
- Operating System: Usually Linux-based, due to its open-source nature and efficiency.

How an HPC Cluster Works

Users submit tasks (referred to as “jobs”) via the login node.
The job scheduler assigns tasks to compute nodes based on resource availability and priority.
Compute nodes work collaboratively, processing data through parallel computing, and then return the results to the user or storage system.

Domain Knowledge

Introduction to the HPC Cluster

Basic Components of an HPC Cluster

How an HPC Cluster Works

Advantages of HPC Clusters