Job Usage
Last Updated: July 9, 2024
The computing nodes in the Genkai node groups A, B, and C are managed by a job management system that allocates resources in response to requests from multiple users, unlike the login nodes. Therefore, to run programs on these nodes, you must first submit a usage request in the form of a job. This document introduces how to use each node group through jobs.
Types of Jobs
In Genkai, you can execute the following two types of jobs:
Job Type | Usage |
---|---|
Batch Job | Batch execution by submitting a job with a pre-written script. Interactive use is not allowed. |
Interactive Job | Interactive execution by logging into a computing node. Mainly intended for short-time debugging and pre/post-processing. |
Additionally, batch jobs have the following three types:
Batch Job Type | Usage |
---|---|
Regular Job | Submits one script and executes one job. |
Step Job | Submits multiple scripts as a single batch and executes them in a specified order. |
Bulk Job | Submits one script and generates multiple regular jobs for execution. |
Batch Job Flow
Generally, to run a program using a batch job, you follow these steps:
- Job operation commands
- Creating a job processing script
- Submitting the job (
pjsub
command) - (If necessary) Checking the job status (
pjstat
command) - (If necessary) Deleting a running or waiting job (
pjdel
command) - Checking the results
- Checking node usage
Job Operation Commands
The commands used for job operations are as follows:
Function | Command |
---|---|
Submit Job | pjsub |
Check Job Status | pjstat |
Delete Job | pjdel |
Check Node Group Congestion | pjshowrsc |
Job Submission (pjsub
Command)
Jobs can be submitted in the following forms, depending on the format. Note that all forms must be submitted within the large-capacity storage (/home) or high-speed storage (/fast).
- Batch Job Format
|
- Interactive Job Format
|
Creating a Batch Processing Script
Basic Options
Option Name | Description |
---|---|
-r | Specify the ID specified in the reservation portal |
-o filename | Output standard output to the file filename |
-e filename | Output standard error to the file filename |
-j | Output the job’s standard error and standard output to the same file |
–interact | Execute as an interactive job |
–restart | Re-execute the job in case of a failure |
–norestart | Do not re-execute the job in case of a failure (default) |
–mail-list mailaddress | Specify the mail destination |
-m | Specify mail notification |
-m b | Mail notification at batch job start |
-m e | Mail notification at batch job end |
-m r | Mail notification at job re-execution |
-X | Inherit environment variables at job submission to the job execution environment |
Batch Job Resource Options
The main options regarding resources needed for batch job processing are as follows. Specify the resources or upper limits following -L.
Option Name | Description |
---|---|
-L rscgrp=name | Resource group (queue) name to which the job is submitted (for details, see Resource Groups) |
-L node | Specify the number of nodes (mandatory when using more than one node) |
-L vnode-core | Specify the number of cores (mandatory when using less than one node in node group A) |
-L gpu | Specify the number of GPUs (mandatory when using less than one node in node groups B and C) |
-L elapse | Specify the maximum job execution time |
-L proc-core= | Specify the maximum core file size limit per process (default: 0, maximum: unlimited) |
-L proc-data | Specify the maximum data segment size limit per process (default: unlimited) |
-L proc-stack | Specify the maximum stack segment size limit per process. If set to unlimited, the actual value will be 2MiB due to RHEL specifications. (default: unlimited) |
-L jobenv | Specify the job environment. If using Singularity, you must specify jobenv=singularity. |
Statistics Output Options
Option Name | Description |
---|---|
-s | Output statistical information of the submitted job (cannot be used with the -S option) |
-S | Output statistical information including node information of the submitted job (cannot be used with the -s option) |
Example Job Scripts
Sequential Job for Node Group A
Below is an example job script to execute a job. This example assumes executing a program compiled with Intel oneAPI.
Resource Specification | Details |
---|---|
Resource Group | a-batch |
Number of CPU Cores | 1 |
Elapsed Time | 1 hour |
Output Standard Error to Standard Output | Yes |
|
Thread Parallel Job for Node Group A
Below is an example job script to execute a job. This example assumes executing a program compiled with Intel oneAPI.
Resource Specification | Details |
---|---|
Resource Group | a-batch |
Number of CPU Cores | 30 |
Number of Threads | 30 |
Elapsed Time | 1 hour |
Output Standard Error to Standard Output | Yes |
|
Hybrid Parallel Job for Node Group A
Below is an example job script to execute a job. This example assumes executing a program compiled and linked with Intel oneAPI and Intel MPI.
Resource Specification | Details |
---|---|
Resource Group | a-batch |
Number of Nodes | 4 |
Number of Processes per Node | 10 |
Number of Threads per Process | 12 |
Elapsed Time | 1 hour |
Output Standard Error to Standard Output | Yes |
|
1 GPU Job for Node Group B
Below is an example job script to execute a job. This example assumes executing a program compiled with CUDA.
Resource Specification | Details |
---|---|
Resource Group | b-batch |
Number of GPUs | 1 |
Elapsed Time | 1 hour |
Output Standard Error to Standard Output | Yes |
|
2 Node Job for Node Group B
Below is an example job script to execute a job.
Resource Specification | Details |
---|---|
Resource Group | b-batch |
Number of Nodes | 2 |
Number of GPUs | 8 |
Number of Processes per Node | 4 |
Elapsed Time | 1 hour |
Output Standard Error to Standard Output | Yes |
|
Executing an Interactive Job
To execute an interactive job, specify the --interact
option with the pjsub
command.
Below is an example of using the resource group a-inter with 1 node for 1 hour in an interactive job.
|
Submitting a Batch Job
Request the processing described in the batch processing script file using the pjsub
command.
|
In this example, the processing described in a file named go.sh
is being requested. The example shows that the job ID 1234 has been assigned.
Checking Job Status
Checking the Status of Running and Waiting Jobs
To check the status of submitted jobs, use the pjstat
command as follows.
|
Here, JOB_ID represents the job number, and ST represents the current state of the job. The main job states are as follows:
Display | State |
---|---|
QUE | Waiting |
RNA | Starting |
RUN | Running |
RNO | Ending |
Viewing History
You can check the execution history from a specified number of days ago (7 days in this example) to the present using the following option:
|
- The job end status is displayed in the “PC” column (0: Normal end, 1: Canceled, etc.).
- For each code number, you can check “
man pjstat
”. - If the job exceeded memory usage, “12” will be output in the “PC” column of the job history.
- The history of past jobs is deleted after a certain period.
Statistical Information for Completed Batch Jobs
You can check the statistical information for any completed job by specifying its job ID (1234 in this example) using the following option:
|
Deleting a Batch Job
You can cancel (delete) running or waiting batch jobs using the pjdel
command. Specify the job ID(s) after the pjdel
command (multiple IDs can be specified). Canceling a running batch job will stop its execution.
|
In this example, a request is made to delete the batch job with job ID 1234, and a message indicates that the job has been successfully deleted.
Checking the Results
If the output file is specified with the -o
option of pjsub
, the results to standard output are written to the specified file. If not specified, the output is written to a file named “[job_script_name].[job_ID].out”. On the other hand, if the -j
option is specified, standard error output is written to the same file as standard output. If the -j
option is not specified, and an output file is specified with the -e
option, the standard error output is written to the specified file. If neither option is specified, the standard error output is written to “[job_script_name].[job_ID].err”.
Checking Resource Group Congestion
To check the congestion status of resource groups, use the pjshowrsc
or show_rsc
commands as follows.
|
|
Executing Step Jobs
A step job is a job model that treats multiple batch jobs as a single entity, specifying the order and dependencies among them to achieve job chaining functionality. Step jobs consist of multiple sub-jobs, and each sub-job is not executed simultaneously.
The submission format for step jobs is as follows.
|
Step Job Dependency Expressions
Condition | Description |
---|---|
NONE | Indicates no dependencies |
Exit status == value[,value,..] |
Any value can be specified for value. In the case of "==" or "!=", multiple values can be specified using a comma (","). Example: ec==1,3,5 → True if the exit status is any of 1, 3, or 5. ec!=1,3,5 → True if the exit status is none of 1, 3, or 5. |
Exit status != value[,value,..] | |
Exit status > value | |
Exit status >= value | |
Exit status < value | |
Exit status <= value |
Deletion Types Specifiable in Step Job Dependency Expressions
Deletion Type | Description |
---|---|
one | Deletes only the specified job. |
after | Deletes the specified job and jobs dependent on it recursively. |
all | Deletes the specified job and all subsequent jobs. |
Executing Bulk Jobs
Bulk jobs are jobs that execute multiple identical batch jobs simultaneously. For example, if you want to change the job parameters and check each execution result, with a regular batch job, you would need to submit each job one by one. However, with a bulk job, you can submit multiple patterns at once.
The submission format for bulk jobs is as follows:
|
When executing a bulk job, you can change the program’s input and output files for each sub-job using the bulk number assigned to each sub-job. The bulk number can be referenced with the environment variable PJM_BULKNUM.
Example
8 sub-jobs read from input files input.1 to input.8 and output to output files output.1 to output.8 respectively.
|
Related Information
For more details on using jobs in Genkai, please refer to the following: