Domain Knowledge
In this section, we will introduce some domain knowledge in this lecture you may touch.
- Introduction to the HPC Cluster
- Homogeneous and Heterogeneous
- What is PVE
- Hand-on (demo)
Introduction to the HPC Cluster
An HPC (High-Performance Computing) cluster is a group of networked computers that collaborate to perform tasks requiring significant computational power. These clusters are commonly used in fields such as scientific research, engineering simulations, big data analysis, weather forecasting, and artificial intelligence training—tasks that a single ordinary computer cannot efficiently complete.
Basic Components of an HPC Cluster
-
Nodes
- An HPC cluster consists of multiple compute nodes, each typically an independent server.
- Nodes are categorized into different types
- Compute Nodes
- Responsible for executing the actual computational tasks.
- Head/Control Node
- Used for task scheduling, resource management, and user interaction.
- Storage Nodes
- Handle data storage and management.
- Login Nodes
- Serve as the entry point for users to log in and submit tasks.
What is an HPC cluster | High Performance Computing
-
High-Speed Interconnect
- Nodes are connected via a high-speed network (e.g., InfiniBand, Ethernet, or fiber optic networks) to enable low-latency and high-throughput data transmission.
- Common network topologies include Fat Tree, Torus, or Mesh.
<aside>
<img src="/icons/info-alternate_blue.svg" alt="/icons/info-alternate_blue.svg" width="40px" />
ADDITION
network.nvidia.com
Introduction to InfiniBand™
</aside>
-
Storage System
- HPC clusters are typically equipped with parallel file systems (e.g., Lustre, GPFS, or BeeGFS) to support large-scale data access.
- Storage can be local disks, Network-Attached Storage (NAS), or Storage Area Network (SAN).
<aside>
<img src="/icons/info-alternate_blue.svg" alt="/icons/info-alternate_blue.svg" width="40px" />
ADDITION
Apply the concept of HPC to companies related to data storage
Home
</aside>
-
Software Stack
- Job Scheduler: Tools like SLURM, PBS, or LSF are used to allocate resources and manage task queues.
- Parallel Programming Tools: Such as MPI (Message Passing Interface) or OpenMP, which enable programs to run in parallel across multiple nodes.
- Operating System: Usually Linux-based, due to its open-source nature and efficiency.
How an HPC Cluster Works
- Users submit tasks (referred to as “jobs”) via the login node.
- The job scheduler assigns tasks to compute nodes based on resource availability and priority.
- Compute nodes work collaboratively, processing data through parallel computing, and then return the results to the user or storage system.
Advantages of HPC Clusters