MPI run austomatic stop

Hello,
I’m running MFiX-21.1.4 on a cluster, and my simulations are occasionally halted due to the “Floating point exception” error. I resume the simulation in serial or by altering the domain decomposition combination while keeping the total number of nodes constant. However, I am confronted with the same problem at a later time. I think there should be a problem with the MPI because executing the case in serial never results in this error.
On this cluster is linux centos7
3dtfm_2022-05-10T203519.341493.zip (1.1 MB)
, I use openmpi/4.1.7.

Does it fail without your modified drag_gs.f file?

I haven’t tried it yet, but in my simulations this file will have to be modified; Are there any other good solutions? Thank you for your

Are you getting a traceback that shows where the FPE (floating-point exception) is happening?

If not, is there a file named core (possibly with numeric suffix) in the run directory? If it is present, you can use it for debugging. Do not upload the file here because it will be large and debugging requires access to the same environment where the core file was produced.

If the file is present - follow the below procedure to get more info.
This is a case where I deliberately introduced a zero-division error in usr_rates.f
Use the file command to find the name of the actual program that ran MFiX, there are several ‘wrappers’ involved and the true program name may vary on different platforms. Whatever the output of file is, use gdb on that executable.

$ cd /tmp/silane_pyrolysis_tfm_2d # project directory
$ ls -l core
-rw------- 1 cgw cgw 1333723136 May 10 09:32 core

$ file core
core: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from '/usr/lib/python-exec/python3.9/python -m mfixgui.pymfix -s -f /tmp/silane_pyrol', real uid: 103, effective uid: 103, real gid: 1000, effective gid: 1000, execfn: '/usr/lib/python-exec/python3.9/python', platform: 'x86_64

$ gdb /usr/lib/python-exec/python3.9/python core 
[lots of output]

Program terminated with signal SIGFPE, Arithmetic exception.
#0  0x00007f68491241cb in usr_rates (ijk=158, rates=...)
    at /tmp/silane_pyrolysis_tfm_2d/usr_rates.f:101
101	         c_SiH4 = RO_g(IJK) * X_g(IJK,SiH4) / (0*Mw_g(SiH4))

(gdb) where
[full stack trace]
1 Like

Thank you. I’ll try your way.But I do not find a file named core (possibly with numeric suffix) in the run directory.And I do not get a traceback that shows where the FPE is happening?This is the example we just stopped.
微信截图_20220511213547

This did not happen when I had SMP running. This happens only when DMP is running. But why does the example run slower with SMP than with DMP?
I still can’t figure it out, that’s where the example comes from, what do I do? Thank you.
屏幕截图 2022-05-12 135440

Yes, it does fail even without the modified files. I have this problem when running files in parallel.

Do you have any other methods? I’ve tried so many things but it doesn’t work.Thanks.

I could not reproduce the issue. It eventually reached DT_MIN but didn’t trigger a floating point exception.
Maybe you can try with MFiX 22.1 see if you have the same issue.

Please copy and paste the whole stack trace (as text),this screenshot is too small to see the whole stack.

The code at kintheory_mod.f:288 looks like this:

      RE = D_p(IJK,M)*RVEL*ROP_G(IJK)/(MU_G(IJK) + SMALL_NUMBER)
      IF(RE .LE. 1000.d0)THEN
         C_d = (24.d0/(Re+SMALL_NUMBER)) * &
            (ONE + 0.15d0 * Re**0.687D0)

It would be helpful to know what the value of Re is here

I’ll try to take a look at this today, if I can reproduce the crash. But we may need you to help debug this. Also, please try with the latest code as Jeff suggested.

– Charles