NVIDIA HPC SDK
Last Updated: December 6, 2024Module
Module name | Version |
---|---|
nvidia | 23.9 (default) |
24.11 |
Refer to the following page for the usage of modules:
Module usage
Overview
NVIDIA HPC SDK (hereafter HPC SDK) is a development environment (a collection of compilers and libraries) provided by NVIDIA.
In addition to the one installed in Genkai, users can also install and use the one downloaded by yourself from the official web page. If you want to try newer compilers and libraries, please install and use the latest version by yourself. (Some programs and libraries may require installation on system by an administrator.)
For detailed information on the HPC SDK and the latest documentation, please check NVIDIA’s website.
Preparation for Use
To use the HPC SDK, you need to load the module beforehand. When executing a program created using the compiler and libraries included in the HPC SDK as a job, the same module must also be loaded in the job script.
Please load the nvidia
module (nvidia/23.9
module) first.
If you use MPI, you also need to load the nvhpcx
or nvompi
module.
(If you load the nvidia
module, it will be visible in avail.)
If you have no particular preference, we recommend using nvhpcx/23.9-cuda12
.
|
How to use compilers
HPC SDK includes several compilers for CPU and GPU. The following shows their basic usage.
Types of compilers
The following is the correspondence between compilers provided by HPC SDK and corresponding parallelization methods.
- nvc/nvc++ : C/C++ compiler, available for both CPU and GPU, supporting OpenMP and OpenACC.
- nvfortran : Fortran compiler, available for both CPU and GPU, supporting OpenMP and OpenACC as well as CUDA Fortran.
- nvcc : Same as included in CUDA Toolkit. (A specific version of the CUDA Toolkit is also included in the HPC SDK.)
Commands used for compilation correspond to the following depending on whether MPI is used or not. For nvcc, please refer to the page on how to use CUDA.
Programming language | compiler (command) | |
---|---|---|
without MPI | Fortran | nvfortran |
C | nvc | |
C++ | nvc++ | |
with MPI | Fortran | mpifort |
C | mpicc | |
C++ | mpic++ |
The following are the main common compilation options available.
For details, specify -help
as a compiler option.
Compiler options | Description |
---|---|
-c | This option performs up to creation of an object file. (It does not create the executable file.) |
-o filename | Change the name of the output executable/object file to filename. The default executable file name is a.out. |
-O | Specifies the degree of optimization performed by the compiler. From -O0 , which is basically no optimization, to -O1 , -O2 , -O3 , and -O4 , where the larger the number, the more optimization is performed, as well as -O and -Ofast . In particular, -O3 and above are said to be ineffective or even counterproductive, depending on the target program. In addition, -fast is also available as a basic (and in many cases faster) optimization option. See -help for details on each optimization option. |
-acc | Enables the OpenACC directive. |
-mp | Enable OpenMP directive. Use -mp=gpu for GPU. |
-gpu | Add a detailed specification of the target GPU when using OpenACC or OpenMP. various options and combinations are available, such as debug, pinned, managed, deepcopy, fastmath, etc. To generate code for H100 GPUs only, use cc90 (-gpu) . |
-tp | Specifies the CPU to be optimized. -tp=sapphirerapids (Intel Xeon SapphireRapids) or -tp=native (same CPU as compiled environment). is recommended in Genkai. |
-Minfo | You can have the compiler output detailed information when it performs optimization. If you want to see only information about GPU optimization, -Minfo=accel is recommended. |
Some compile examples are shown below.
- Compile an OpenMP program (for CPU, without MPI)
|
- Compile an OpenMP program (for GPU, without MPI)
|
- Compile an OpenACC program (for GPU, without MPI)
|
- Compile a MPI + OpenMP program (for CPU)
|
- Compile a MPI + OpenMP program (for GPU)
|
- Compile a MPI + OpenACC program (for GPU)
|
How to run a batch job
Here is an example of a batch job script that executes a program using HPC SDK.
CPU program using OpenMP (example of CPU execution in one node)
|
The following is an example of executing an OpenMP parallel program using 4 CPU cores in node group A.
The OMP_NUM_THREADS
environment variable specifies the number of threads to be executed.
It is strongly recommended to specify the number of threads within the range not exceeding the number of cores allocated.
OpenACC program or GPU program with OpenMP (example of one sub-GPU execution)
|
This is an example of execution of a GPU parallelization program by OpenACC or OpenMP using one sub-GPU of node group B.
When OpenACC is executed, the environment variables NVCOMPILER_ACC_NOTIFY
orNVCOMPILER_ACC_TIME
are set so that it can be easily confirmed that OpenACC is performing calculation using GPUs.
Since OpenMP does not have such an environment variable, it is recommended to check it with a profiler (see below).
CPU program using MPI + OpenMP (example of 8 processes running in one node)
|
This is an example of MPI+OpenMP hybrid parallelization program execution using all one node of node group A. Node group A has two sockets with 60-core CPUs, and the cores and memory on each socket are divided into four sub-NUMAs, so that one process is allocated to each sub-NUMA and 15 threads are executed per process.
A separate explanation page will be created on how to optimally allocate processes and threads.
CPU program using MPI + OpenMP (example of 2 nodes x 8 processes execution in one node)
|
The above script for one-node execution can be easily extended to multi-node execution by appropriately increasing the number of required nodes specified by -L node=1
and the total number of processes specified by -n
in mpirun
.
In this method, the processes are placed on a node as many times as necessary, and then placed on the next node.
(It is not possible to place the first process on each node first, then the second process on each node, etc.)
MPI + GPU program using OpenACC or OpenMP (example of 4 GPU execution in one node)
There are several ways to program and execute programs using multiple GPUs, but the one that seems easiest to understand and use is “parallelize a program that operates one GPU using OpenACC or OpenMP with MPI, mapping one MPI process to one GPU.” Each node in node group B has 2 CPUs and 4 GPUs, and each CPU is divided into 2 sub-NUMAs, so it is appropriate to assign each process to be in charge of 1 sub-NUMA and 1 GPU. Each node in Node Group C has 2 CPUs and 8 GPUs, and each CPU is divided into 4 sub-NUMAs, so it is appropriate to assign 1 sub-NUMA and 1 GPU per process. (Note that the number of CPU cores corresponding to a sub-NUMA varies by node group.)
The following is an example of job execution using 4 GPUs on one node in node group B.
In job_openacc2.sh
, the PJM option sets up a job using one node fully (node=1
), and mpirun starts one process per sub-NUMA, totaling 4 processes.
In line 13, nvidia-smi -L
outputs the ID of the GPU, and only the information necessary to identify the GPU to be used is written to a text file.
In run2.sh
, a serial number (ID) is generated from the Open MPI rank number, and the program is started by selecting the GPU to be used based on it.
This allows each MPI process to use a different GPU.
Note that run2.sh
must have execute permission.
(e.g. chmod u+x . /run2.sh
)
(Because information is written and referred to in a file called gpu.txt, problems may occur if multiple programs (jobs) are run simultaneously in the same directory. If necessary, create a subdirectory for each job and run it, etc.)
|
|
If you set the number of nodes at the beginning of the job script and the number of processes in mpirun (-n
option) appropriately, you can run the job on multiple nodes as well.
If you want to use only some GPUs in a node, change -L node=1
to the number of GPUs you want to use like -L gpu=2
and set the process setting of mpirun appropriately.
(Multiple) sub-GPU program with MIG
The H100 GPU on node group B can be divided into up to 7 sub-GPUs by the MIG function. In Genkai, GPUs on some nodes are divided into 7 sub-GPUs by the MIG function, so that a total of 28 sub-GPUs are visible in 4 GPUs. The divided GPUs are GPUs with low computing performance and small memory capacity, but they can be used normally for their functions, Please make effective use of them for debugging and other purposes.
Note that the point consumption per sub-GPU is 1/7 of the full GPU. Note that the theoretical computing performance and memory capacity of a sub-GPU is slightly lower than 1/7 of a full GPU.
When executing a program using a sub-GPU, specify b-batch-mig
or b-inter-mig
as the resource group.
The quantity specified with gpu=
is the number of sub-GPUs, not full GPUs.
There are 28 sub-GPUs per node, so you can specify a maximum of 28.
However, there is no guarantee that these resource groups will be allocated on the same physical GPU when multiple sub-GPUs are requested.
However, depending on the availability of resources, a sub-GPU on a different physical GPU may be allocated even if the number of GPUs is set to 7 or less.
The following is an example of an interactive job with 4 sub-GPUs in which sub-GPUs on different physical GPUs are allocated. In this example, the CPU cores are allocated only on the same socket side, but depending on node utilization, they may be allocated across sockets.
|
How to use the libraries
The HPC SDK includes various libraries.
Libraries included in Version 23.9
- cuBLAS
- cuBLASMp
- cuTENSOR
- cuSOLVER
- cuSOLVERMp
- cuFFT
- cuFFTMp
- cuRAND
- NCCL
- NVSHMEM
When using them, it is necessary to refer to header files or add compile options as needed. Some examples are shown below, but please refer to the official documents and other sources for details.
Example of using cuBLAS
To use cuBLAS from CUDA C, you need to include cublas_v2.h
and link cublas (add -lcublas
option to nvcc compiler).
Similarly, when using cuBLAS from C + OpenACC/OpenMP(GPU), you need to include cublas_v2.h
and add -lcublas
option to nvc compiler.
However, this alone will not find the library file (libcublas.so) to be linked with -lcublas
and an error will occur.
To solve this problem, you must also load the cuda module or give the path to libcublas.so in the HPC SDK directory with the -L option.
(Specify -L/home/app/nvhpc/23.9/Linux_x86_64/23.9/math_libs/lib64/
)
To use cuBLAS from CUDA Fortran, you need to write use cublas
and link cublas (add -cudalib=cublas
option to your nvfortran compiler).
To use cuBLAS from Fortran + OpenACC/OpenMP (GPU), add the same option.
How to use other tools
In addition to the above, the HPC SDK includes several other tools such as a debugger and profiler. Here we briefly introduce how to use the debugger and profiler, which will be useful for many Genkai users.
How to use the debugger
You can use cuda-gdb
to debug GPU kernels.
You can debug GPU kernels in almost the same way as gdb
.
When you use cuda-gdb
, first compile your program with options for debugger.
If you use nvc or nvfortran, use -g
or -gopt
(-g
disables optimization, -gopt
does not disable optimization); if you use nvcc, use -g
or -G
(-g
for host code, -G
for device code to enable debugging information generation option) specified.
|
cuda-gdb
supports per-thread tracing and other GPU-oriented features.
Please refer to the documentation for details.
How to use the profiler
NVIDIA Nsight Systems and NVIDIA Nsight Compute are available as performance analysis tools (profilers). Nsight Systems is useful for overall program performance, while Nsight Compute provides a detailed understanding of the GPU kernel.
In the following, only basic usage is introduced. For more details, please refer to the manual or other documentation.
How to use Nsight Systems
Information is collected by the nsys
program.
Execution example (in job script)
|
The -o
option specifies the name of the file to write the profile results to.
In the above example, the file name is based on the job ID information that the batch job system sets for each job.
In practice, a file like out_82448.nsys-rep
is created.
Open the generated nsys-rep
file with Nsight Systems (GUI program)
to view the program information.
Nsight Systems can also be started on login nodes with the command nsys-ui
. (X-forwarding is required.)
However, since clients for Windows, Linux, and macOS are distributed, it is often more comfortable to download the nsys-rep
file and browse it on your PC at hand.
The --stats=true
option is not required.
With this option, the profile results are displayed in text at the end of the program execution.
Note that if there is a lot of information to be covered, the job execution time will increase by that much.
How to use Nsight Compute
Information is collected by the ncu
program.
Execution example (in job script)
|
The -o
option specifies the name of the file to write the profile results to.
In the above example, the file name is based on the job ID information that the batch job system sets for each job.
In practice, a file like out_82449.ncu-rep
is created.
Open the generated ncu-rep
file with Nsight Compute (GUI program)
to view the program information.
Nsight Compute can also be started on login nodes with the command ncu-ui
. (X-forwarding is required.)
However, since clients for Windows, Linux, and macOS are distributed, it is often more comfortable to download the ncu-rep
file and browse it on your PC at hand.
When running ncu-ui
on a login node, the following additional settings are required.
ulimit -v 26000000
.
(This raises the virtual memory limit. If you get an error message and exit, try a larger value, such as 28000000.)