Queue Submission

The Queue node provides a way to automatically construct submission scripts, submit jobs to a queueing system such as Slurm, and check the status of the job. The node has been constructed to be flexible, allowing for complete customization. This node is demonstrated in the Ex. 6: Generic model submission example.

To quickly get started, there are templates that can be loaded by selecting a template from the Load template drop down list. The options can be edited on the Options, Commands, and Script tabs.

Each directory provided by the directories terminal is treated as a single job. The Finished directories terminal is a list of directories that the job has exited the queue. The Finished mask provides a boolean list which is the same size as the directories input where the entry is True if the job has exited the queue and False if the job is still running or is still queued.

The jobs can be manually submitted by pressing the Submit button or automatically submitted by running the sheet.

Note

Just because a job is listed as Done in the job table and is listed in the Finished directories terminal does not mean that job completed successfully, it simply means that the job is no longer in the queue.

Jobs

The Jobs tab lists the Job ID from the queue system, which Queue the job has been submitted too, the Status, and the path to the job location.

The refresh button will check the status of the jobs when pressed, and automatically check the status of the jobs every second while the refresh button is checked. The job status will stop being checked if the refresh button is not checked or when all the jobs are finished.

../_images/queue_jobs.png

Options

Submission options are controlled on the Options tab. Queues to submit to can be selected and added. If multiple queues are selected, the jobs will be evenly distributed. A job name can be provided in the Job name field.

This node allows for packaging multiple runs into a single job submission which can either be run sequentially (serial) or concurrently (parallel). If multiple runs will be used, then the Run CMD will be written multiple times in the submission script (replacing the ${cmd} variable). This Run CMD needs to support multiple runs. For Slurm jobs, this should include using srun and specifying the run directory with --chdir.

Runs per job specifies the number of runs in a single job. If the jobs are to be run concurrently, select the concurrent checkbox. If the concurrent checkbox is selected, an & will be appended to each Run CMD and a wait will be placed after all the run commands.

For example, using a Run CMD of srun --chdir=${cwd} python run.py, 4 Runs per job, and the concurrent checkbox selected will replace the ${cmd} variable in the job script with:

srun --chdir=/path/to/run_001 python run.py &
srun --chdir=/path/to/run_002 python run.py &
srun --chdir=/path/to/run_003 python run.py &
srun --chdir=/path/to/run_004 python run.py &
wait

Since some queues have user limits on the number of jobs that can be submitted, this limit can be entered in the Maximum jobs field. This will prevent the Queue node from submitting too many jobs. If there are more jobs than allowed, the Queue node will wait until jobs have finished before submitting additional jobs. This check and submission of jobs occurs when the refresh button is pressed.

Note

The variables used in the Script are in displayed in the [].

../_images/queue_options.png

Commands

The Commands tab is where the queue manager specific submission and status commands are provided along with regular expressions that are used to extract information from the stdout of the commands.

Slurm example

For Slurm the submission command is sbatch <submission_script>, so the Submission command would be sbatch. The node will add the correct path to the submission script automatically. The sbatch command returns the job id via stdout, which looks like:

Submitted batch job 123456

The job id can be extracted with a simple regular expression that looks for an integer, which is entered in the Job ID regex field:

(\d+)

Similarly, a job status can be checked by calling squeue -j <job_id>, so the Status command field would be squeue -j ${job_id}. The squeue command returns the status, along with other information, to stdout:

> squeue -j 123456
JOBID   PARTITION       NAME      USER ST       TIME     NODES NODELIST(REASON)
123456    general     my_job    myname  R    2:23:43        10 n[0945-0954]

> squeue -j 12
slurm_load_jobs error: Invalid job id specified

The job status can either be:

  • R - Job is running on compute nodes

  • PD - Job is waiting on compute nodes

  • CG - Job is completing

The job status can be extracted from the stdout with the following regular expression:

\s(R|PD|CG)\s
../_images/queue_cmds.png

Script

The last tab provides a text editor where the submission script is edited. This script can be completely customized and will be used as the submission script after replacing the variable tags, ${variable}. The scripts are saved in the current working directory, named queue_submit.script###### where the # symbols are replaced with a 6 character hash.

The following variables can be used throughout the script:

  • ${job_name} - Job name as specified on the Options tab

  • ${queue} - queue or partition name as specified on the Options tab

  • ${cwd} - current Working directory of the job (or parent directory of the job if using more than on job in a script)

  • ${cmd} - The actual run command as specified on the Options tab

Example Slurm submission script

#!/bin/bash -l

## The name for the job.
#SBATCH --job-name=${job_name}
##
## Number of cores to request (each node has 40 cores)
#SBATCH --tasks=40
##
## Queue Name (general, bigmem, gpu)
#SBATCH --partition=${queue}
##
## Working directory
#SBATCH --chdir=${cwd}

## Load Modules (run "module avail" for list)
module load anaconda

## Run the job
${cmd}
../_images/queue_script.png