CUDA Toolkit

Last Updated: December 6, 2024

Module

Module name Version
cuda 11.8.0
12.2.2 (default)
12.6.1

Refer to the following page for the usage of modules:
Module usage


Overview

CUDA Toolkit (CUDA) is a development tool for GPUs provided by NVIDIA. Compared to writing GPU programs using OpenACC or OpenMP, CUDA requires a larger amount of code, but allows you to maximize the functionality of the GPU. We recommend using CUDA when performance is important.

For more detailed information on CUDA and the latest documentation, please check NVIDIA’s website.


Preparation for use

To use CUDA, you need to load the module beforehand.

To use CUDA, load the cuda module (cuda/12.2.2 is the latest and default at the time of writing this document).

If you want to use MPI, you also need to load the MPI-related modules. At the time of writing this document, hpcx, ompi-cuda, and ompi can be loaded by additionally loading the gcc module. All of them are CUDA-compatible MPIs, and we are still investigating which one has the highest performance. (The basic usage is the same for both.)


How to compile a program

CUDA includes a compiler for CUDA C, which can be used with the nvcc command; even login nodes without a GPU can compile CUDA C programs.

The main compile-time options are as follows.

compile-time options description
-c Compile the input file into an object file. (not create executable file)
-o filename Change the name of the output executable/object file to filename. The default executable file name is a.out.
-O Specify the degree of optimization to be performed by the compiler for the host code. The information specified here is passed directly to the compiler that compiles the host code (the gcc command by default). e.g. -O3
-g, -G Generates information for the debugger. The -g option is for host (CPU) code and -G is for device (GPU) code.
-lineinfo Generates line number information for the profiler. It is recommended that this information be attached to the profiler for analysis.
-ccbin cmd This is used to specify the compiler to compile the host code. For example, if you want to compile the host code with Intel compiler (icc), use -ccbin icc.
-Xcompiler option Specify options to pass to the backend compiler. For example, if you want to enable thread parallelization with OpenMP when compiling the host code with the Intel compiler (icc), specify -ccbin icc -Xcompiler -qopenmp.
-arch, -gencode Specify the GPU to run on. You do not have to specify anything, but if you specify -arch=sm_90 -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90a,code=sm_90a -gencode=arch=compute_90a, code=compute_90a may create a high-performance executable file that will only work on Hopper GPUs.
-I, -L, -l Specify the path to look for header files, specify the path to look for library files to link, specify libraries to link (similar to many other compilers, so details are omitted)

Some compile examples are shown below. Note that the standard extension for CUDA C source code is cu.

  • Compiling a CUDA C program (non-MPI version)
$ moudle load cuda
$ nvcc -O3 -o test ./test.cu
  • Compile CUDA C program (using OpenMP for CPU parallelization, non-MPI version)
$ moudle load cuda
$ nvcc -O3 -Xcompiler -fopenmp -o test ./test.cu
  • Compiling a CUDA C program (MPI version)
$ moudle load gcc hpcx cuda
$ nvcc -O3 -lmpi -o test ./test.cu
  • Compile CUDA C program (using OpenMP for CPU parallelization, MPI version)
$ moudle load gcc hpcx cuda
$ nvcc -O3 -Xcompiler -fopenmp -lmpi -o test ./test.cu

How to execute a batch job

When executing a program using CUDA, please preload the same modules as when compiling.

Example of program execution using one GPU:

$ cat job_cuda.sh
#!/bin/bash
#PJM -L rscgrp=b-batch
#PJM -L elapse=10:00
#PJM -L gpu=1
#PJM -j

module load cuda
./a.out

How to use the debugger and profiler

See HPC SDK page. nsys, ncu and other commands are available by loading the cuda module.