ホーム > Home (English) > System > Usage of ITO > Batch Job Operation

Batch Job Operation


To run programs on the subsystems A and B of ITO, users need to write a sequence of operations into a text file, and submit it to the management system of ITO. This non-interactive usage of computers is "batch processing", and each sequence of operations is a "batch job". The text file that describes the contents of a batch job is a "batch job script". Each batch job is stored in a queue and automatically executed according to the availability of the resources.


Flow of Operation

Typical flow of operating batch jobs is as follows:
1. Create a batch job script.
2. Submit the script (via pjsub command).
3. (If necessary) Query for the status of the job (via pjstat command).
4. (If necessary) Cancel the batch job (via pjdel command).
5. Check the result of the batch job.

On subsystems A and B of ITO, resources are represented as "virtual nodes", rather than physical nodes, as they are attached to batch jobs. Each virtual node has one or more CPU cores and some amount of memory. Number of CPU cores and amount of memory of one virtual node cannot exceed those of one physical node.


1. Create a Batch Job Script

A batch job script is a text file that describes the computations the user wants to perform on the subsystem. Following is an example of it:

#!/bin/bash                         <-- Specifies that this will be executed by bash.
#PJM -L "rscunit=ito-a"             
#PJM -L "rscgrp=ito-s-dbg"
#PJM -L "vnode=2"                   <-- Each field seperated by blank after #PJM is treated as an option of pjsub command
#PJM -L "vnode-core=36"             

# comment                           <-- Lines starts with #, not #PJM, are comments
# comment

command                             <-- Describe sequence of commands to be executed as a shell script
command

Followings are options of pjsub command:

1.1 Basic Options

Name Description
-o filename Store Standard Output to the file specified by "filename"
-e filename Store Standard Error to the file specified by "filename"
-j Store both Standard Output and Standard Error to the same file
--mail-list mailaddress Specify the mail address to send information about the job
-m parameter Specify when to send mails
-m b Send a mail when the batch job starts
-m e Send a mail when the batch job finishes
-m r Send a mail when the batch job is re-executed
-X Inherit environmental variables at batch job submission to the running environment of the job

1.2 Options about Resources of Batch Jobs

Following are frequently used options about resource requirements of batch jobs. They are all followed by -L, and specifies which and how much resources to be used.

Name Description
-L rscunit=name Name of the subsystem to use
-L rscgrp=name Name of the resource group (= queue) to submit the job
-L elapse=limit Maximum time of execution (elapsed time)
-L vnode=limit Maximum number of physical nodes (For programs with Fujitsu MPI, specify the number of processes, here)
-L vnode-core=limit Maximum number of cores per node (For programs with Fujitsu MPI, specify the number of cores per process, here)

How to Choose Resource Group, Number of Cores and Number of Nodes

  • Resource group limits the number of cores, the number of physical nodes and the amount of memory to be used in a batch job.
    Limits of Resources
  • Choose the resource group that provides the amount of memory sufficient for the program.
  • Numbers to be specified for vnode and vnode-core depend on the MPI library:
    Fujitsu MPI: specify the numbers of processes as vnode, then choose vnode-core so that the product of vnode and vnode-core is the same or slightly smaller than the total numbers of cores of the resource group.
    Otherwise : specify the number of nodes to be used as vnode, and specify the available numbers of cores of the resource group as vnode-core .
  • In case the product of vnode and vnode-core is far smaller than the total number of cores of the resource group, that batch job waistes the remaining cores.

Effect of Choosing Appropriate Value for the Maximum Time of Execution

Because of the effect of "Backfill Job Scheduling", batch jobs with shorter Maximum Time of Execution tend to start earlier than the ones with longer time. This facility allows a batch job to overtake earlier batch jobs, as far as it does not delay the start times of other batch jobs. This enables filling gaps appear amang large-scale batch jobs with those shorter batch jobs. However, please notice that the job will be terminated if it exceeds this time limit.

1.3 Options for Statistics Information

Name Description
-s Output statistics information of the batch job.
(Cannot be specified with -S)
-S Output statistics information of the batch job, and the nodes used by the batch job.
(Cannot be specified with -s)

Statistics information of a batch job, such as the amount of memory consumed, is stored in the file, "Filename_of_the_batch_job_script".i"Job_ID" . The name of the file can be specified by the option --spath. Use -S if the information of nodes (or processes) is required.

Batch jobs that are killed because of exceeding memory limits can be checked by the command introduced at section 3.2.


2. Submitting Batch Jobs (pjsub command)

Submit the batch job script file created in section 1. via pjsub command.

$ pjsub go.sh
[INFO] PJM 0000 pjsub Job 1234 sunmitted.
$

This example submits operations stored in the file go.sh , and got 1234 as the ID of the batch job.
Refer to "Job Operation Software End-user's Guide" for details of the options.


3. Query for the Status of Batch Jobs (pjstat command)

3.1 Query for the Status of Running or Waiting Batch Jobs

pjstat command shows the statuses of batch jobs that are currently running, or waiting.

Refer to "Job Operation Software End-user's Guide" for details.


3.2 Query for the History of Batch Jobs

The following command shows the history of batch jobs executed for recent xx days (Default: 7 days).

$ pjstat -H day=7 -v

  • "PC" column shows the final status of the batch job (0: Success, 1: Canceled, etc.)
  • Meaning of each status code can be checked by the command "man pjstat".
  • "12" means that the job is stopped because of exceeding memory limitation.
  • Histories older than a specific period will be deleted.


3.3 Query for the Statistical Information of Completed Batch Jobs

The following command shows the statistical information of the specified job ID.

$ pjstat -H -S 1234

4. Cancelation of Batch Jobs (pjdel command)

pjdel command cancels batch jobs with running or waiting status. Specify job ID(s) to cancel after the command. Cancelation of a running batch job stops the program.

$ pjdel 1234
[INFO] PJM 0100 pjdel Job 1234 canceled.
$

This example shows that the cancelation of the job ID 1234 is requested, and the request is completed.


5. Check Results

The standard output of a batch job is stored in the file specified by the -o option of pjsub. If the option has not been specified, the filename "batch_job_script_filename".o"job_ID" is used. The standard error of a batch job is stored in the same file as standard output, if the -j option is specified. Otherwise, if the filename is specified with the -e option, it is stored in this file. If neither of these two options are specified, the filename "batch_job_script_filename".e"job_ID" is used to store the standard error.


6. Example of Batch Job Script

Following is an example of a batch job script that executes a non-paralel program. Descriptions of scripts for executing parallelized programs depends on the method, compiler and libraries used. Please refer to the web pages of the compilers. In addition to that, applications may require additional settings. Please feel free to ask questions about execution of applications to the customer support.


  • Number of nodes
  • : 1 node
  • Number of cores per node
  • : 9 cores (fixed for this resource group)
  • Maximum execution time
  • : 10 minutes
  • Store standard output and standard error in the same file
  • : True
  • Resource group
  • : ito-g-1-dbg
    #!/bin/bash
    #PJM -L "rscunit=ito-b"
    #PJM -L "rscgrp=ito-g-1-dbg"
    #PJM -L "vnode=1"
    #PJM -L "vnode-core=9"
    #PJM -L "elapse=10:00"
    #PJM -j
    #PJM -X
    
    ./a.out
    
    
      Name of the subsystem (= ito-b)
      Name of the resource group (= ito-g-1-dbg)
      Number of virtual nodes (= 1)
      Number of cores per node (= 9 for ito-g-1-dbg)
      Maximum execution time
      Store stdout and stderr in the same file
      Inherit environment variables to the batch job execution
    
    Execute the program