Insufficient info in run log

Hello Team, I’m having a TFM simulation that crashes after ~0.5 secs and the only information I have is

SLURM Job_id=6523435 Name=V_10D Ended, Run time 02:30:48, FAILED, ExitCode 136

1. A core.63957 of around 2GB is written out. Can we get any more information from this file?

2. Slurm log gives the following information.

/cm/local/apps/slurm/var/spool/job6523435/slurm_script: line 9: activate: No such file or directory

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
** Something went wrong while running addr2line. **
** Falling back to a simpler backtrace scheme. **
#26 0x2aaaff8462d2
#27 0x2aaaff846a0e
#28 0x2aaaab1f327f
#29 0x2aaab33a1a09
#30 0x2aaab35855ce
#31 0x2aaab35866ee
#32 0x2aaab341727e
#33 0x2aaab3411fa8
#34 0x2aaab31d81e7
#35 0x2aaab33a2b1b
#36 0x2aaab31da854
#37 0x2aaab31cdb66
#38 0x5555556c0cca
#39 0x55555572639d
#40 0x5555556b8e7a
#41 0x55555572173f
#42 0x5555556b8e7a
#43 0x55555572173f
#44 0x55555566985a
#45 0x5555556884d2
#46 0x55555567affd
#47 0x555555779f76
#48 0x555555734817
#49 0x2aaaaafa8dd4
#50 0x2aaaab2bbb3c
#51 0xffffffffffffffff

3. I looked on internet and perhaps exit code 136 is for floating-point errors. I refined the cell size by half and the case fails at same point of ~0.5 secs. How can we debug this in MFiX?

Thank you.

@jagan1mohan

Hi Jagan

  1. It would be helpful to know more about your setup - what version of MFiX you are running, what kind of system are you running on, what compiler are you using, how are you compiling the solver, are you using GUI or batch-mode, etc? It looks like the solver is not getting compiled with the correct settings for debugging.

If using the GUI you can select “Submit bug report” from the main menu. Otherwise, attach your project files here.

  1. If you have a core file:

file core.NNNNN will tell you the name of the executable that created the core file. When running from the GUI, the executable is actually Python, since that’s the top-level process that runs everything in GUI mode. In batch mode, the executable will be called mfixsolver. The file core command will tell you the full path to the executable.

Then do:

gdb <path_to_exe> core.NNNNNN
to invoke the GDB debugger, then you can do command like where to get a backtrace, info locals to see local variables, p to print values, etc.

  1. Where is slurm_script coming from? It looks like it is trying to run a command called activate which is not found in the default $PATH. Not sure if this is a fatal error though.

  2. This is definitely a floating-point exception (SIGFPE), I don’t know why the stack trace doesn’t show more useful information. If you send your project file, we’ll run it here and see if we can get a more useful stack trace.

– Charles

Hello Charles @cgw, thank you for your quick reply. I’m hereby attaching complete setup folder for your perusal.

→ I’m compiling setup on cluster using GUI by clicking spanner button. I’ve attached an image for your reference. These are must and any changes would lead to compilation errors on cluster.

1

By cluster, I mean, slurm job scheduler and few commands such as sbatch, scancel are used.

→ Slurm log comes when we submit a job using batch and gets into run state. There is a cmd file in the folder which is used to submit the job.

I’m requesting you to compile solver with attached user-defined file on 24 CPUs. Could you please help me here? Thank you.
S_38_25_20D.zip (1.6 MB)

This job runs for me without error, both serially and with mpifort n=24.

Can you try running the job on your local machine, with DMP enabled? Error handling is more robust in the single-CPU mode.

– Charles

Hello Charles @cgw,

1. How long did you run the case? I encounter a floating point after 0.5 secs.

2. I’ve compiled on local machine, please see the attached image, is this correct?

3. Also, I sometimes observe that STL files are not read correctly. Could you check following numbers in the run log?
============================================================================
** MESH STATISTICS:**
============================================================================
** NUMBER OF CELLS = 41250**
** NUMBER OF STANDARD CELLS = 12936 ( 31.36 % of Total)**
** NUMBER OF CUT CELLS = 10032 ( 24.32 % of Total)**
** NUMBER OF FLUID CELLS = 22968 ( 55.68 % of Total)**
** NUMBER OF BLOCKED CELLS = 18282 ( 44.32 % of Total)**
============================================================================

Thank you.

Does it also fail if you run locally in serial mode (disable DMP and rebuild)?

– Charles