Dear All,
How to estimate the total core hours a task would require when configure and use the computing clusters? And how to determine the following parameters?
#SBATCH --nodes
#SBATCH --ntasks
#SBATCH --cpus-per-task
#SBATCH --mem
Thank you.
Dear All,
How to estimate the total core hours a task would require when configure and use the computing clusters? And how to determine the following parameters?
#SBATCH --nodes
#SBATCH --ntasks
#SBATCH --cpus-per-task
#SBATCH --mem
Thank you.
Ju,
I would read your universityâs documentation on how to use Slurm directives. How many nodes and tasks you need will depend entirely on your clusterâs settings and the computational overhead for your simulation. For example, if your jobs requires NODESI * NODESJ * NODESK = 48 cores and your university cluster has 16 cores/node, you would need to set --ntasks=48 and --nodes=3 or more. Some clusters have a testing partition (i.e. --partition=shas-testing) that allows you to run a job for a short period and estimate total time required from that.
I have never set --cpus-per-task or --mem before, I would check and make sure you really need to change these from their defaults.
Thank you Julia, I hereby attach the page for how to use SBATCH for SCITAS.
https://scitas-data.epfl.ch/confluence/display/DOC/Using+the+clusters
And I have built DMP and SMP solver for that with command build_mfixsolver [âdmp]/[âsmp]. I have tested SMP under Sinteract mode with command OMP_NUM_THREADS=4 ./mfixsolver -f DES_FB1.mfx. I want to know how to use SBATCH to configure DMP/SMP tasks considering the SMP and DMP solvers have been built. Can I directly run commands such as âmpirun -np 4 ./mfixsolver -f DES_FB1.mfx NODESI=2 NODESJ=2â under /home/username for DMP task? How can I combine srun .x with the above tasks? Thank you in advance for your answers. Thank you in advance for your answers.
Ju,
If your cluster documentation does not indicate how many cores/node are on the cluster partition you have access to, you will need to reach out to the research computing staff/cluster admins at your university.
It sounds like you may be confusing a few concepts so I will try to clarify below. When you compile the solver, you create an mfixsolver executable that is now âwaitingâ to be run on whatever computing resources you allocate to it. srun and mpirun are both Slurm âwrapperâ commands for running MPI applications, so I donât know what you mean by âhow to combine srun.x with the above tasksâ because mpirun is a separate command from srun. You either use srun or mpirun but not both (although generally speaking, according to my clusterâs documentation mpirun seems to be preferred over srun if you have the option).
The general workflow is to build your solver â create a Linux bash script to submit the job to your cluster with #SBATCH directives â submit the Linux bash script.
As an example, letâs say youâre sitting in /home/username/ and youâve successfully compiled your MFIX script for DMP with 4 processors. Now you have an mfixsolver executable waiting to be run. Assuming you have logged into your command terminal on a node capable of submitting job scripts, you need to:
If you donât submit a shell script with #SBATCH directives and instead just type âmpirun -np 4 ./mfixsolver -f DES_FB1.mfx NODESI=2 NODESJ=2â directly in the command line, I think it will try to run the job on the node your terminal is connected to. Even if this command ran successfully for you, it is not good practice because you are bypassing the Slurm scheduler and probably running directly on login nodes or computing resources not specifically allocated for running jobs.
Thank you Julia, I want to know that if if set the time in the â.runâ file for the SBATCH, and when the time out, the simulation project is still not finished, whether I can resume the simulation later with the same command/solver and whether I can run the .mfx file with another command/solver without deleting the intermediate files? And whether the .RES file for resuming a project is unique for each kind of solver? Thank you in advance.
Ju,
Assuming your simulation has run for long enough that it has produced at least one restart file (res_dt < stop time), you can restart a simulation by changing the run_type to âRESTART_1â as explained in the MFIX documentatin (8.3.1. Run Control â MFiX 21.3.2 documentation). So, if your simulation stops and you want to restart it, edit run_type in the .mfx file and then resubmit the SBATCH file (âsbatch sbatchrun.shâ).
As far as I know, .RES files contain simulation data so yes, they will be unique to your compilation settings and simulation run. I would not copy and paste these in another directory to restart an unrelated simulation.
Since .mfx files are just input files (text files) you can move them between directories and reuse them for additional simulations. For example, if you wanted to build two solvers with different settings, you can copy and paste the .mfx file from one directory to the other. But anything that contains binary data or data specific to your simulation (i.e. mfixsolver executable, VTK/SP* outputs, .RES files, etc.) I would advise not putting in an unrelated simulation directory. Even if it runs, you could break some links or get odd results if you try to restart a simulation with another simulationâs data.
Thank you Julia, I did restart work as follows.
I want to know whether it is normal that the simulation timestamp in the restart slurm .out file began at 0.0000s, which means that it represents relative progression only, and in reality it actually continued the last step of the previous ârunâ
Thank you in advance.