Automatically requeue a job that reached wall-time limit

ebreard6 · October 11, 2022, 1:45pm

Hi,

I am running a MPI job on a cluster using slurm that has wall-time limit of 24h. I need to get the mfix run to restart_1, thus to resubmit itself every time it reaches the wall-time limit; does anyone have a suggestion on how to achieve that?

I tried:

jobid=$(sbatch --parsable test.sh)
sbatch --dependency=afterok:$jobid test.sh

but the dependency is not satisfied once the first run reaches the wall-time limit

2433252 standard test1 ebreard PD 0:00 1 (DependencyNeverSatisfied)

Thanks for the help.

ebreard6 · October 11, 2022, 3:08pm

jobid=$(sbatch --parsable test.sh)
sbatch --dependency=afterany:$jobid test.sh

to restart once;

to restart 6 times:

jobid1=$(sbatch --parsable test.sh)
jobid2=$(sbatch --parsable --dependency=afterany:$jobid1 test.sh)
jobid3=$(sbatch --parsable --dependency=afterany:$jobid2 test.sh)
jobid4=$(sbatch --parsable --dependency=afterany:$jobid3 test.sh)
jobid5=$(sbatch --parsable --dependency=afterany:$jobid4 test.sh)
jobid6=$(sbatch --parsable --dependency=afterany:$jobid5 test.sh)
jobid7=$(sbatch --parsable --dependency=afterany:$jobid6 test.sh)

is the solution; note this will initiate the next runs even if the simulation crashed and you need to make sure restart_1 is used in the *.mfx already in the *.sh script, otherwise one needs an extra step of having new changed to restart_1

jeff.dietiker · October 11, 2022, 3:58pm

If the queue time limit is 24 hours, please make sure you are using the max wall time feature in the run pane as shown below or set the following keywords in the .mfx file:

In the .mfx file:

chk_batchq_end = True
batch_wallclock = 86400.0
term_buffer = 600.0

This will make sure the run terminates cleanly 10 minutes before the 24 hour time limit is reached. A restart file will be written and MFiX will exit. This is done to prevent 1) corrupting the restart file if MFiX happens to be terminated by the queue system exactly while the restart file is written and 2) to have a restart file with the latest results independently of the restart file frequency.

cgw · October 11, 2022, 4:02pm

Note that due to a bug in mfix (confusion between wall time and CPU time) using this feature may cause DMP jobs to terminate prematurely, see Time issue during running the simulation (REQUESTED CPU TIME LIMIT REACHED)

This is at the top of the list of bugs to fix for 22.4!

– Charles

cgw · October 11, 2022, 4:14pm

You can safely use batch_wallclock for DMP jobs on your cluster. The bug I refer to seems to only apply to SMP jobs. Sorry for the confusion.