Successfully submitted HPC task with running status can not generate the intermediate result files

Dear All,

I have successfully submitted HPC task with running status can not generate the intermediate result files such as .vtk, .msh, .RES. The .LOG file shows that the mesh has been successfully computed. I want to know whether there exists conditon that even the task is failed or experience unexpected errors, the simulation task would still show “running” status instead of stopping.
PIC 1: no related outputs


PIC2: .LOG file shows generation of mesh, and the target folder has been set correctly in the SBATCH script as well.

Thank you in advance.

Okay, I see the .LOG was generated before, which means that the new simulation task in the computing cluster did not generate any new .LOG file. So how to fix the problem?

The same project can work normally and generate outputs files in my local ubuntu, but I can not see any output files in the target folder from computing cluster. Have you encountered this kind of condition before? How can I fix that?

Check the slurm-12264515.out file. It may contain info about the failure.

Let me reiterate my previous recommendation: start with much simpler simulations until you are comfortable with the workflow (setup, run, post-processing). This will be much less frustrating and will give you a better idea of what you can actually accomplish.

Thank you Jeff, I tried the simplest project fluid_bed_tfm_2d.mfx with same same procedure as the above under SLURM but no output results are generated as well. But this simple example can be run with default command “mfixsolver -f fluid_bed_tfm_2d.mfx” and generate the expected outputs. It is so weird. I hereby attach the snapshots of my .run file for SBATCH.

And it can be submitted successfully and reached its preset simulation time without interruption, but it can not generate any outputs as follows.

PIC: running status without output for simple 2d project.

Thank you in advance.

Hi Ju - You should try to get help from your local computing department, since this seems to be an issue on your cluster. But if you want us to try to help, please attach the slurm.out file. Thanks.

Thank you Charles, I hereby attach the .run and .out corresponding to the snapshots above. And I also snapshot the targeted project folder that you can see there are no output results generated up to the end of preset one-hour simulation time.


Since the .run and .out are not allowed by the forum, I compress them as a zip file instead.
test1.zip (666 Bytes)
Thank you in advance.

$ cat slurm-1862763.out 
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** JOB 1862763 ON h083 CANCELLED AT 2021-10-29T17:18:09 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 1862763.0 ON h083 CANCELLED AT 2021-10-29T17:18:09 DUE TO TIME LIMIT ***

The messages are saying that the job was cancelled due to time limit.
It would be good to try to get help from a local expert. Also this StackOverflow posting may be relevant: https://stackoverflow.com/questions/34653226/job-unexpectedly-cancelled-due-to-time-limit

Thank you Charles. I tried to use “OMP_NUM_THREADS=16 mfixsolver -c -f fluid_tfm_2d.mfx” for simple 2D TFM simulation task and it works. But when I use the same command to run a more complicated case, it would freeze and fail just after computing the interpolation factors in V-momentum cells with “core-dumped” error as follows. What incurred that? And how can I fix that? Thank you in advance.


“Core dump” generally indicates a serious problem - either a bug in the solver, or particles leaving the domain. You can build a debug-enabled version of the solver to try to find the problem. If you want help, please submit a complete problem report including input files and logs, I’ll take a look at it.

– Charles

Thank you Charles, I can just successfully run the same project with command “mfixsolver -c -f chamber_delicate_structure2_clean.mfx” on my local ubuntu OS with VBOX.

But when I switched to the computing cluster, the project will freeze and fail upon “computing interpolation factors in v-momentum cells” with “Bus error (core dumped)” with either “OMP_NUM_THREADS=16 mfixsolver -c -f chamber_delicate_structure2_clean.mfx” or “mfixsolver -c -f chamber_delicate_structure2_clean.mfx”.

This is the snapshot of .LOG after failure under computing cluster.
log
I also attach the compressed project files as the input.
chamber_developer_bm1.zip (7.2 MB)

Thank you in advance for your solution.

Just to be clear, when running on the cluster, did you build the solver on that system or did you try to use the same solver you built on your laptop? In general it has to be built for the target system.

I have to use “ssh -X username@clustername.epfl.ch” to log in the computing cluster. And then I have to install the software under one of folder of Miniconda of my personal folder /home/username on the computing cluster as well. After the installation in the computing cluster, I think the default solver mfixsolver has already existed, so I loaded the necessary modules and compiled successfully the --dmp and --smp solvers as well.

One of the problem is that the core dumped error even happened for the default mfixsolver. Thank you.

Hi, Charles, could you please share the SBATCH script you have used for SLURM for me to refer to? And as you suggested to me, the only operations to let the .mfx file adapt to DMP mode are to change the “nodesi= ,nodesj=, nodesk=,” lines in .mfx file, what else should I modify for .mfx for DMP mode? And what else should I modify .mfx to let it adapt to SMP mode? Thank you.

Hi Ju. I’m looking into the core dump problem.
I’m not sure what files you are asking about. There is an example slurm template in the directory share/mfix/templates/queue_templates directory in the MFiX environment.

For DMP you need a DMP-enabled solver, a domain decomposition, and settings of nodes[ijk] that match the domain decomposition.

For SMP you need an SMP-enabled solver, nodes[ijk] should all be 1 and you control the number of CPU threads with the OMP_NUM_THREADS environment variable, like so

env OMP_NUM_THREADS=5 ./mfixsolver -s -f foo.mfx

Hi Ju. It looks like this job is using a lot of RAM until it runs out and is killed by the OS.

At the beginning of the run, the solver prints:

 Memory required:    9.00 Mb

but this is far from the truth! (We should probably disable this message if we can’t print an accurate estimate).

In fact the job ran until it used up all the available RAM on my system - 16GB. Here’s a snapshot of the memory used by the solver, as reported by ps -o rss=, in units of MB. After 100s the solver was killed.

ram

This is a simple serial run, but it won’t get any better using SMP. With DMP the job will be split over multiple nodes, so the memory usage per node is lower, but I suspect you are probably running out of RAM on the computing cluster too.

– Charles

Thank you Charles. But as I said, the project can just run and generate the mesh with the default solver “mfixsolver -c -f” without problems on my local ubuntu on VBOX. So why it would run out of the RAM of computing cluster? Can the project successfully run as well in your local ubuntu with the default solver? Thank you.

No, running the solver on my laptop results in out-of-memory condition as I indicated above. I have 16GB on my machine, how much do you have?

I increased the amount of swap space on my system and now I can run the job. This is not ideal, because swapping makes things even slower, but I can get the job to run with 16GB RAM + 16GB swap.

The total memory usage peaks out at about 14GB and then stabilizes, even coming down a bit.

ram2

But the simulation is running extremely slowly - after 10 minutes of real time, the simulation time is still at t=0.

How long did you run this simulation? How much RAM did you have available?

Thank you Charles. The project can be run directly on ubuntu on my laptop with VBOX and command “mfixsolver -c -f xx.mfx”, and it has 17G ram and about 8G swap mem as follows.
ram1

DId you run to completion (tstop=0.5) ? If so how long did that take? I observed that the simulation time never got past t=0.