What does "REQUESTED CPU TIME LIMIT REACHED" mean?

Hi, everyone

Could you please tell me what does “REQUESTED CPU TIME LIMIT REACHED” mean?
When my simulation running, there are no common errors displayed, but it stopped with the message “REQUESTED CPU TIME LIMIT REACHED”. I found that wall time limit is set to 9000s by default. And i changed it to 90000s, now the run is going on without any errors alarmed.
But i am still confused that why the wall time limit is preseted to 9000s? Or my way to solve this problem, in fact, is not right?

Hi Wuming,
Let me try to answer your question as I originally implemented this feature, which was later adapted for the recent versions of MFIX by Mark.
When running MFIX in a batch queue environment where you submit a job script that launches MFIX executable once the job gains priority and starts running, this setting ensures that MFIX run is properly terminated by saving all files without loosing recently computed results due to end of batch queue session. When you are running interactive session without a batch queue, you simply trigger the shutdown manually. However, in a batch queue session user needs to prescribe a wall-clock time specification for the duration time of the run when requesting the batch queue session. Then you need to inform MFIX about this setting through a parameter, which needs to be less than equal to the batch queue wall-clock duration. It is good idea to leave some buffer time in case writing the files takes longer for high resolution cases.
I think the 9000 seconds (2.5 hours) must have been the default setting, which you can adjust based on your needs and queue duration. However, when I checked the source code, it appears to be set to 2 days (172800 s) in init_namelist.f:
BATCH_WALLCLOCK = 172800.0
You can learn more by examining the “check_bqend.f” file in MFIX source tree.
Always make sure BATCH_WALLCLOCK is set less than to your batch queue wall-clock duration specified if you want to ensure all recent iterations are saved properly before terminating cleanly.

I hope this answers your question.

Aytekin

Thank you, Dr. Aytekin. Your answer is very detailed.

To save time, for a batch queue, in my opinion, it is better to add some sentences to in program to check if the run has stopped or is going on, and check whether computed data has been stored competely, because the time needed to fininsh these implementation is hard to be estimated exactly for the first run, and it tends to be set larger. Right?

Wuming,

Again the primary reason this feature was added due to the HPC systems with strict batch queue limits. Hence, our jobs were being killed before MFIX had a chance to save the most recent data. In all of these cases we need to restart again for many times to finish the simulation. For example, you are running on an HPC system that has a queue limit of max wall-clock time <= 10 hours. You know your run will require more than that so you set BATCH_WALLCLOCK accordingly. Once your first session finishes, you automatically submit back to the queue to restart from where you left for another 10 hours. This feature basically tells MFIX it has a maximum of 10 hours to run then needs to shut down cleanly no matter.
You don’t need to enable this feature if you don’t need it for example you are running on your own dedicated HPC cluster.
I hope this clarifies.

Aytekin

1 Like