NVIDIA HPC SDK

Last Updated: December 6, 2024

Module

Module name Version
nvidia 23.9 (default)
24.11

Refer to the following page for the usage of modules:
Module usage


Overview

NVIDIA HPC SDK (hereafter HPC SDK) is a development environment (a collection of compilers and libraries) provided by NVIDIA.


In addition to the one installed in Genkai, users can also install and use the one downloaded by yourself from the official web page. If you want to try newer compilers and libraries, please install and use the latest version by yourself. (Some programs and libraries may require installation on system by an administrator.)


For detailed information on the HPC SDK and the latest documentation, please check NVIDIA’s website.


Preparation for Use

To use the HPC SDK, you need to load the module beforehand. When executing a program created using the compiler and libraries included in the HPC SDK as a job, the same module must also be loaded in the job script.


Please load the nvidia module (nvidia/23.9 module) first.


If you use MPI, you also need to load the nvhpcx or nvompi module. (If you load the nvidia module, it will be visible in avail.) If you have no particular preference, we recommend using nvhpcx/23.9-cuda12.

[ku40000105@genkai0002 ~]$ module avail
------------------------------ /home/modules/modulefiles/LN/core ------------------------------
cuda/11.8.0           gcc-toolset/12(default)  intel/2023.2(default)  nvidia/24.11
cuda/12.2.2(default)  gcc-toolset/13           intel/2024.1
cuda/12.6.1           gcc/8(default)           nvidia/23.9(default)

------------------------------ /home/modules/modulefiles/LN/util ------------------------------
avs/express85(default)           mathematica/14.0(default)
aws_pcluster/3.9.1(default)      matlab/R2024a(default)
awscli/2.16.8(default)           matlab_parallel_server/R2024a(default)
azure_cyclecli/8.6.2(default)    mesa/20.3.5
fieldview/2023(default)          molpro/2024.1.0_mpipr
gaussian/16.C.01(default)        molpro/2024.1.0_sockets
gcloudcli/447.0.0(default)       nastran/2024.1(default)
gcloudcli/502.0.0                node/12.22.12
julia/1.10.3(default)            ocicli/3.42.0(default)
jupyter_notebook/7.2.1(default)  singularity-ce/4.1.3(default)
llvm/7.1.0                       tensorflow-cpu/2.17.0(default)
marc/2024.1(default)
[ku40000105@genkai0002 ~]$ module load nvidia
[ku40000105@genkai0002 ~]$ module avail
---------------------- /home/modules/modulefiles/LN/compiler/nvidia/23.9 ----------------------
fftw/3.3.10(default)  netcdf-cxx/4.3.1(default)      nvhpcx/23.9
hdf5/1.14.4(default)  netcdf-fortran/4.6.1(default)  nvhpcx/23.9-cuda12
hpcx/2.17.1(default)  netcdf/4.9.2(default)          nvompi/23.9

------------------------------ /home/modules/modulefiles/LN/core ------------------------------
cuda/11.8.0           gcc-toolset/12(default)  intel/2023.2(default)  nvidia/24.11
cuda/12.2.2(default)  gcc-toolset/13           intel/2024.1
cuda/12.6.1           gcc/8(default)           nvidia/23.9(default)

------------------------------ /home/modules/modulefiles/LN/util ------------------------------
avs/express85(default)           mathematica/14.0(default)
aws_pcluster/3.9.1(default)      matlab/R2024a(default)
awscli/2.16.8(default)           matlab_parallel_server/R2024a(default)
azure_cyclecli/8.6.2(default)    mesa/20.3.5
fieldview/2023(default)          molpro/2024.1.0_mpipr
gaussian/16.C.01(default)        molpro/2024.1.0_sockets
gcloudcli/447.0.0(default)       nastran/2024.1(default)
gcloudcli/502.0.0                node/12.22.12
julia/1.10.3(default)            ocicli/3.42.0(default)
jupyter_notebook/7.2.1(default)  singularity-ce/4.1.3(default)
llvm/7.1.0                       tensorflow-cpu/2.17.0(default)
marc/2024.1(default)
[ku40000105@genkai0002 ~]$

How to use compilers

HPC SDK includes several compilers for CPU and GPU. The following shows their basic usage.


Types of compilers

The following is the correspondence between compilers provided by HPC SDK and corresponding parallelization methods.

  • nvc/nvc++ : C/C++ compiler, available for both CPU and GPU, supporting OpenMP and OpenACC.
  • nvfortran : Fortran compiler, available for both CPU and GPU, supporting OpenMP and OpenACC as well as CUDA Fortran.
  • nvcc : Same as included in CUDA Toolkit. (A specific version of the CUDA Toolkit is also included in the HPC SDK.)

Commands used for compilation correspond to the following depending on whether MPI is used or not. For nvcc, please refer to the page on how to use CUDA.


Programming language compiler (command)
without MPI Fortran nvfortran
C nvc
C++ nvc++
with MPI Fortran mpifort
C mpicc
C++ mpic++

The following are the main common compilation options available. For details, specify -help as a compiler option.


Compiler options Description
-c This option performs up to creation of an object file. (It does not create the executable file.)
-o filename Change the name of the output executable/object file to filename. The default executable file name is a.out.
-O Specifies the degree of optimization performed by the compiler. From -O0, which is basically no optimization, to -O1, -O2, -O3, and -O4, where the larger the number, the more optimization is performed, as well as -O and -Ofast. In particular, -O3 and above are said to be ineffective or even counterproductive, depending on the target program. In addition, -fast is also available as a basic (and in many cases faster) optimization option. See -help for details on each optimization option.
-acc Enables the OpenACC directive.
-mp Enable OpenMP directive. Use -mp=gpu for GPU.
-gpu Add a detailed specification of the target GPU when using OpenACC or OpenMP. various options and combinations are available, such as debug, pinned, managed, deepcopy, fastmath, etc. To generate code for H100 GPUs only, use cc90 (-gpu).
-tp Specifies the CPU to be optimized. -tp=sapphirerapids (Intel Xeon SapphireRapids) or -tp=native (same CPU as compiled environment). is recommended in Genkai.
-Minfo You can have the compiler output detailed information when it performs optimization. If you want to see only information about GPU optimization, -Minfo=accel is recommended.

Some compile examples are shown below.


  • Compile an OpenMP program (for CPU, without MPI)
$ nvc -Minfo -fast -mp -tp=native -o openmp1_c openmp1.c
$ nvc++ -Minfo -fast -mp -tp=native -o openmp1_cpp openmp1.cpp
$ nvfortran -Minfo -fast -mp -tp=native -o openmp1_f openmp1.f90
  • Compile an OpenMP program (for GPU, without MPI)
$ nvc -Minfo -fast -mp=gpu -gpu=pinned,cc90 -tp=native -o openmp2_c openmp2.c
$ nvc++ -Minfo -fast -mp=gpu -gpu=pinned,cc90 -tp=native -o openmp2_cpp openmp2.cpp
$ nvfortran -Minfo -fast -mp=gpu -gpu=pinned,cc90 -tp=native -oopenmp2_f openmp2.f90
  • Compile an OpenACC program (for GPU, without MPI)
$ nvc -Minfo -fast -acc -gpu=pinned,cc90 -tp=native -o openacc1_c openacc1.c
$ nvc++ -Minfo -fast -acc -gpu=pinned,cc90 -tp=native -o openacc1_cpp openacc1.cpp
$ nvfortran -Minfo -fast -acc -gpu=pinned,cc90 -tp=native -o openacc1_f openacc1.f90
  • Compile a MPI + OpenMP program (for CPU)
$ mpicc -Minfo -fast -tp=native -o mpi_openmp1_c mpi_openmp1.c
$ mpic++ -Minfo -fast -tp=native -o mpi_openmp1_cpp mpi_openmp1.cpp
$ mpifortran -Minfo -fast -tp=native -o mpi_openmp1_f mpi_openmp1.f90
  • Compile a MPI + OpenMP program (for GPU)
$ mpicc -Minfo -fast -mp=gpu -gpu=pinned,cc90 -tp=native -o mpi_openmp2_c mpi_openmp2.c
$ mpic++ -Minfo -fast -mp=gpu -gpu=pinned,cc90 -tp=native -o mpi_openmp2_cpp mpi_openmp2.cpp
$ mpifortran -Minfo -fast -mp=gpu -gpu=pinned,cc90 -tp=native -o mpi_openmp2_f mpi_openmp2.f90
  • Compile a MPI + OpenACC program (for GPU)
$ mpicc -Minfo -fast -acc -gpu=pinned,cc90 -tp=native -o mpi_openacc1_c mpi_openacc1.c
$ mpic++ -Minfo -fast -acc -gpu=pinned,cc90 -tp=native -o mpi_openacc1_cpp mpi_openacc1.cpp
$ mpifortran -Minfo -fast -acc -gpu=pinned,cc90 -tp=native -o mpi_openacc1_f mpi_openacc1.f90

How to run a batch job

Here is an example of a batch job script that executes a program using HPC SDK.


CPU program using OpenMP (example of CPU execution in one node)

$ cat job_openmp1.sh
#!/bin/bash
#PJM -L rscgrp=a-batch
#PJM -L elapse=10:00
#PJM -L vnode-core=4
#PJM -j

module load nvidia/23.9
export OMP_NUM_THREADS=4
./a.out

The following is an example of executing an OpenMP parallel program using 4 CPU cores in node group A. The OMP_NUM_THREADS environment variable specifies the number of threads to be executed. It is strongly recommended to specify the number of threads within the range not exceeding the number of cores allocated.


OpenACC program or GPU program with OpenMP (example of one sub-GPU execution)

$ cat job_openacc1.sh
#!/bin/bash
#PJM -L rscgrp=b-batch-mig
#PJM -L elapse=10:00
#PJM -L gpu=1
#PJM -j

module load nvidia/23.9
./a.out

This is an example of execution of a GPU parallelization program by OpenACC or OpenMP using one sub-GPU of node group B.


When OpenACC is executed, the environment variables NVCOMPILER_ACC_NOTIFY orNVCOMPILER_ACC_TIME are set so that it can be easily confirmed that OpenACC is performing calculation using GPUs. Since OpenMP does not have such an environment variable, it is recommended to check it with a profiler (see below).


CPU program using MPI + OpenMP (example of 8 processes running in one node)

$ cat job_mpi_openmp1.sh
#!/bin/bash
#PJM -L rscgrp=a-batch
#PJM -L elapse=10:00
#PJM -L node=1
#PJM -j

module load nvidia/23.9
module load nvhpcx/23.9
export OMP_NUM_THREADS=15
mpirun -n 8 --map-by ppr:1:numa --bind-to numa numactl -l ./mpi_openmp.out

This is an example of MPI+OpenMP hybrid parallelization program execution using all one node of node group A. Node group A has two sockets with 60-core CPUs, and the cores and memory on each socket are divided into four sub-NUMAs, so that one process is allocated to each sub-NUMA and 15 threads are executed per process.


A separate explanation page will be created on how to optimally allocate processes and threads.


CPU program using MPI + OpenMP (example of 2 nodes x 8 processes execution in one node)

$ cat job_mpi_openmp2.sh
#!/bin/bash
#PJM -L rscgrp=a-batch
#PJM -L elapse=10:00
#PJM -L node=2
#PJM -j

module load nvidia/23.9
module load nvhpcx/23.9
export OMP_NUM_THREADS=15
mpirun -n 16 --map-by ppr:1:numa --bind-to numa numactl -l ./mpi_openmp.out

The above script for one-node execution can be easily extended to multi-node execution by appropriately increasing the number of required nodes specified by -L node=1 and the total number of processes specified by -n in mpirun. In this method, the processes are placed on a node as many times as necessary, and then placed on the next node. (It is not possible to place the first process on each node first, then the second process on each node, etc.)


MPI + GPU program using OpenACC or OpenMP (example of 4 GPU execution in one node)

There are several ways to program and execute programs using multiple GPUs, but the one that seems easiest to understand and use is “parallelize a program that operates one GPU using OpenACC or OpenMP with MPI, mapping one MPI process to one GPU.” Each node in node group B has 2 CPUs and 4 GPUs, and each CPU is divided into 2 sub-NUMAs, so it is appropriate to assign each process to be in charge of 1 sub-NUMA and 1 GPU. Each node in Node Group C has 2 CPUs and 8 GPUs, and each CPU is divided into 4 sub-NUMAs, so it is appropriate to assign 1 sub-NUMA and 1 GPU per process. (Note that the number of CPU cores corresponding to a sub-NUMA varies by node group.)


The following is an example of job execution using 4 GPUs on one node in node group B. In job_openacc2.sh, the PJM option sets up a job using one node fully (node=1), and mpirun starts one process per sub-NUMA, totaling 4 processes. In line 13, nvidia-smi -L outputs the ID of the GPU, and only the information necessary to identify the GPU to be used is written to a text file. In run2.sh, a serial number (ID) is generated from the Open MPI rank number, and the program is started by selecting the GPU to be used based on it. This allows each MPI process to use a different GPU. Note that run2.sh must have execute permission. (e.g. chmod u+x . /run2.sh)


(Because information is written and referred to in a file called gpu.txt, problems may occur if multiple programs (jobs) are run simultaneously in the same directory. If necessary, create a subdirectory for each job and run it, etc.)

$ cat -n job_openacc2.sh
 1  #!/bin/bash
 2  #PJM -L rscgrp=b-batch
 3  #PJM -L elapse=10:00
 4  #PJM -L node=1
 5  #PJM -j
 6  #PJM -S
 7
 8  numactl -s
 9  numactl -H
10
11  module load nvidia/23.9
12  module load nvhpcx/23.9
13  nvidia-smi -L | grep GPU | awk '{print $6}' | sed 's/)//' 2>&1 | tee gpu.txt
14  mpirun -n 4 --map-by ppr:1:numa --bind-to numa numactl -l ./run2.sh ./mpi_openacc.out
$ cat -n run2.sh
 1  #!/bin/bash
 2  ID=$(( ${OMPI_COMM_WORLD_LOCAL_RANK} + 1 ))
 3  GPU=`head -n ${ID} ./gpu.txt | tail -n 1`
 4  export CUDA_VISIBLE_DEVICES=${GPU}
 5  $@

If you set the number of nodes at the beginning of the job script and the number of processes in mpirun (-n option) appropriately, you can run the job on multiple nodes as well.


If you want to use only some GPUs in a node, change -L node=1 to the number of GPUs you want to use like -L gpu=2 and set the process setting of mpirun appropriately.


(Multiple) sub-GPU program with MIG

The H100 GPU on node group B can be divided into up to 7 sub-GPUs by the MIG function. In Genkai, GPUs on some nodes are divided into 7 sub-GPUs by the MIG function, so that a total of 28 sub-GPUs are visible in 4 GPUs. The divided GPUs are GPUs with low computing performance and small memory capacity, but they can be used normally for their functions, Please make effective use of them for debugging and other purposes.


Note that the point consumption per sub-GPU is 1/7 of the full GPU. Note that the theoretical computing performance and memory capacity of a sub-GPU is slightly lower than 1/7 of a full GPU.

When executing a program using a sub-GPU, specify b-batch-mig or b-inter-mig as the resource group.
The quantity specified with gpu= is the number of sub-GPUs, not full GPUs. There are 28 sub-GPUs per node, so you can specify a maximum of 28. However, there is no guarantee that these resource groups will be allocated on the same physical GPU when multiple sub-GPUs are requested. However, depending on the availability of resources, a sub-GPU on a different physical GPU may be allocated even if the number of GPUs is set to 7 or less.


The following is an example of an interactive job with 4 sub-GPUs in which sub-GPUs on different physical GPUs are allocated. In this example, the CPU cores are allocated only on the same socket side, but depending on node utilization, they may be allocated across sockets.

[ku40000105@genkai0002 ~]$ pjsub --interact -L rscgrp=b-inter-mig,elapse=10:00,gpu=4
[INFO] PJM 0000 pjsub Job 84169 submitted.
[INFO] PJM 0081 .connected.
[INFO] PJM 0082 pjsub Interactive job 84169 started.
[ku40000105@b0036 ~]$ nvidia-smi -L
GPU 0: NVIDIA H100 (UUID: GPU-c2b4960d-78ab-1043-5478-8454fa020a11)
  MIG 1g.12gb     Device  0: (UUID: MIG-835ebdbf-f00e-52b2-9adc-37cc5051f230)
  MIG 1g.12gb     Device  1: (UUID: MIG-ba0d9ecb-6ecb-5f02-b6ed-93253b38dc50)
  MIG 1g.12gb     Device  2: (UUID: MIG-0e23b49e-b8ea-5c54-bc08-54d94d7480ea)
GPU 1: NVIDIA H100 (UUID: GPU-7721b910-c226-83a7-13a2-b4951ea9af7c)
  MIG 1g.12gb     Device  0: (UUID: MIG-4e9bed12-d011-5bf4-9a59-3f5fa51fdb22)
[ku40000105@b0036 ~]$ numactl -s
policy: default
preferred node: current
physcpubind: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
cpubind: 1
nodebind: 1
membind: 0 1

How to use the libraries

The HPC SDK includes various libraries.


Libraries included in Version 23.9

  • cuBLAS
  • cuBLASMp
  • cuTENSOR
  • cuSOLVER
  • cuSOLVERMp
  • cuFFT
  • cuFFTMp
  • cuRAND
  • NCCL
  • NVSHMEM

When using them, it is necessary to refer to header files or add compile options as needed. Some examples are shown below, but please refer to the official documents and other sources for details.


Example of using cuBLAS

To use cuBLAS from CUDA C, you need to include cublas_v2.h and link cublas (add -lcublas option to nvcc compiler). Similarly, when using cuBLAS from C + OpenACC/OpenMP(GPU), you need to include cublas_v2.h and add -lcublas option to nvc compiler. However, this alone will not find the library file (libcublas.so) to be linked with -lcublas and an error will occur. To solve this problem, you must also load the cuda module or give the path to libcublas.so in the HPC SDK directory with the -L option. (Specify -L/home/app/nvhpc/23.9/Linux_x86_64/23.9/math_libs/lib64/)


To use cuBLAS from CUDA Fortran, you need to write use cublas and link cublas (add -cudalib=cublas option to your nvfortran compiler). To use cuBLAS from Fortran + OpenACC/OpenMP (GPU), add the same option.


How to use other tools

In addition to the above, the HPC SDK includes several other tools such as a debugger and profiler. Here we briefly introduce how to use the debugger and profiler, which will be useful for many Genkai users.


How to use the debugger

You can use cuda-gdb to debug GPU kernels. You can debug GPU kernels in almost the same way as gdb.


When you use cuda-gdb, first compile your program with options for debugger. If you use nvc or nvfortran, use -g or -gopt (-g disables optimization, -gopt does not disable optimization); if you use nvcc, use -g or -G (-g for host code, -G for device code to enable debugging information generation option) specified.

# Run on compute nodes (expected to be used in interactive jobs)
$ nvc -Minfo -gopt -acc -gpu=pinned,cc90 -tp=native -o openacc1_c openacc1.c
# Run cuda-gdb
$ cuda-gdb ./openacc1_c
# Start running program on cuda-gdb
(cuda-gdb) run
# Run and output information if there is a problem with the program

cuda-gdb supports per-thread tracing and other GPU-oriented features. Please refer to the documentation for details.


How to use the profiler

NVIDIA Nsight Systems and NVIDIA Nsight Compute are available as performance analysis tools (profilers). Nsight Systems is useful for overall program performance, while Nsight Compute provides a detailed understanding of the GPU kernel.


In the following, only basic usage is introduced. For more details, please refer to the manual or other documentation.

How to use Nsight Systems

Information is collected by the nsys program.


Execution example (in job script)

$ nsys profile -o out_${PJM_JOBID} --stats=true ./a.out

The -o option specifies the name of the file to write the profile results to. In the above example, the file name is based on the job ID information that the batch job system sets for each job. In practice, a file like out_82448.nsys-rep is created.


Open the generated nsys-rep file with Nsight Systems (GUI program) to view the program information. Nsight Systems can also be started on login nodes with the command nsys-ui. (X-forwarding is required.) However, since clients for Windows, Linux, and macOS are distributed, it is often more comfortable to download the nsys-rep file and browse it on your PC at hand.


The --stats=true option is not required. With this option, the profile results are displayed in text at the end of the program execution. Note that if there is a lot of information to be covered, the job execution time will increase by that much.

How to use Nsight Compute

Information is collected by the ncu program.


Execution example (in job script)

$ ncu -o out_${PJM_JOBID} ./a.out

The -o option specifies the name of the file to write the profile results to. In the above example, the file name is based on the job ID information that the batch job system sets for each job. In practice, a file like out_82449.ncu-rep is created.


Open the generated ncu-rep file with Nsight Compute (GUI program) to view the program information. Nsight Compute can also be started on login nodes with the command ncu-ui. (X-forwarding is required.) However, since clients for Windows, Linux, and macOS are distributed, it is often more comfortable to download the ncu-rep file and browse it on your PC at hand.


When running ncu-ui on a login node, the following additional settings are required. ulimit -v 26000000. (This raises the virtual memory limit. If you get an error message and exit, try a larger value, such as 28000000.)