.. _hpc-queue: Queue Submission ================ The ``Queue`` node provides a way to automatically construct submission scripts, submit jobs to a queueing system such as `Slurm`_, and check the status of the job. The node has been constructed to be flexible, allowing for complete customization. This node is demonstrated in the :ref:`sma-ex6` example. To quickly get started, there are `templates` that can be loaded by selecting a template from the ``Load template`` drop down list. The options can be edited on the ``Options``, ``Commands``, and ``Script`` tabs. Each directory provided by the ``directories`` terminal is treated as a single job. The ``Finished directories`` terminal is a list of directories that the job has exited the queue. The ``Finished mask`` provides a boolean list which is the same size as the ``directories`` input where the entry is ``True`` if the job has exited the queue and ``False`` if the job is still running or is still queued. The jobs can be manually submitted by pressing the ``Submit`` button or automatically submitted by running the sheet. .. Note:: Just because a job is listed as ``Done`` in the job table and is listed in the ``Finished directories`` terminal does not mean that job completed successfully, it simply means that the job is no longer in the queue. Jobs ---- The ``Jobs`` tab lists the ``Job ID`` from the queue system, which ``Queue`` the job has been submitted too, the ``Status``, and the path to the job location. The |refresh| button will check the status of the jobs when pressed, and automatically check the status of the jobs every second while the |refresh| button is checked. The job status will stop being checked if the |refresh| button is not checked or when all the jobs are finished. .. |refresh| image:: ../../../nodeworks/images/refresh.svg .. figure:: ./images/queue_jobs.png :align: center :scale: 75 % Options ------- Submission options are controlled on the ``Options`` tab. Queues to submit to can be selected and added. If multiple queues are selected, the jobs will be evenly distributed. A job name can be provided in the ``Job name`` field. This node allows for packaging multiple runs into a single job submission which can either be run sequentially (serial) or concurrently (parallel). If multiple runs will be used, then the ``Run CMD`` will be written multiple times in the submission script (replacing the ``${cmd}`` variable). This ``Run CMD`` needs to support multiple runs. For Slurm jobs, this should include using ``srun`` and specifying the run directory with ``--chdir``. ``Runs per job`` specifies the number of runs in a single job. If the jobs are to be run concurrently, select the ``concurrent`` checkbox. If the ``concurrent`` checkbox is selected, an ``&`` will be appended to each ``Run CMD`` and a ``wait`` will be placed after all the run commands. For example, using a ``Run CMD`` of ``srun --chdir=${cwd} python run.py``, 4 ``Runs per job``, and the ``concurrent`` checkbox selected will replace the ``${cmd}`` variable in the job script with: .. code-block:: srun --chdir=/path/to/run_001 python run.py & srun --chdir=/path/to/run_002 python run.py & srun --chdir=/path/to/run_003 python run.py & srun --chdir=/path/to/run_004 python run.py & wait Since some queues have user limits on the number of jobs that can be submitted, this limit can be entered in the ``Maximum jobs`` field. This will prevent the ``Queue`` node from submitting too many jobs. If there are more jobs than allowed, the ``Queue`` node will wait until jobs have finished before submitting additional jobs. This check and submission of jobs occurs when the |refresh| button is pressed. .. note:: The variables used in the ``Script`` are in displayed in the ``[]``. .. figure:: ./images/queue_options.png :align: center :scale: 75 % Commands -------- The ``Commands`` tab is where the queue manager specific submission and status commands are provided along with regular expressions that are used to extract information from the ``stdout`` of the commands. Slurm example +++++++++++++ For `Slurm`_ the submission command is ``sbatch ``, so the ``Submission command`` would be ``sbatch``. The node will add the correct path to the submission script automatically. The ``sbatch`` command returns the job id via ``stdout``, which looks like: .. code-block:: Submitted batch job 123456 The job id can be extracted with a simple regular expression that looks for an integer, which is entered in the ``Job ID regex`` field: .. code-block:: (\d+) Similarly, a job status can be checked by calling ``squeue -j ``, so the ``Status command`` field would be ``squeue -j ${job_id}``. The ``squeue`` command returns the status, along with other information, to ``stdout``: .. code-block:: > squeue -j 123456 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 123456 general my_job myname R 2:23:43 10 n[0945-0954] > squeue -j 12 slurm_load_jobs error: Invalid job id specified The job status can either be: * ``R`` - Job is running on compute nodes * ``PD`` - Job is waiting on compute nodes * ``CG`` - Job is completing The job status can be extracted from the stdout with the following regular expression: .. code-block:: \s(R|PD|CG)\s .. figure:: ./images/queue_cmds.png :align: center :scale: 75 % Script ------ The last tab provides a text editor where the submission script is edited. This script can be completely customized and will be used as the submission script after replacing the variable tags, ``${variable}``. The scripts are saved in the current working directory, named ``queue_submit.script######`` where the ``#`` symbols are replaced with a 6 character hash. The following variables can be used throughout the script: * ``${job_name}`` - Job name as specified on the ``Options`` tab * ``${queue}`` - queue or partition name as specified on the ``Options`` tab * ``${cwd}`` - current Working directory of the job (or parent directory of the job if using more than on job in a script) * ``${cmd}`` - The actual run command as specified on the ``Options`` tab Example Slurm submission script .. code-block:: #!/bin/bash -l ## The name for the job. #SBATCH --job-name=${job_name} ## ## Number of cores to request (each node has 40 cores) #SBATCH --tasks=40 ## ## Queue Name (general, bigmem, gpu) #SBATCH --partition=${queue} ## ## Working directory #SBATCH --chdir=${cwd} ## Load Modules (run "module avail" for list) module load anaconda ## Run the job ${cmd} .. figure:: ./images/queue_script.png :align: center :scale: 75 .. _Slurm: https://slurm.schedmd.com/overview.html