MPI run austomatic stop

Xiaod112 · May 10, 2022, 12:47pm

Hello,
I’m running MFiX-21.1.4 on a cluster, and my simulations are occasionally halted due to the “Floating point exception” error. I resume the simulation in serial or by altering the domain decomposition combination while keeping the total number of nodes constant. However, I am confronted with the same problem at a later time. I think there should be a problem with the MPI because executing the case in serial never results in this error.
On this cluster is linux centos7
3dtfm_2022-05-10T203519.341493.zip (1.1 MB)
, I use openmpi/4.1.7.

jeff.dietiker · May 10, 2022, 12:57pm

Does it fail without your modified drag_gs.f file?

Xiaod112 · May 10, 2022, 1:06pm

I haven’t tried it yet, but in my simulations this file will have to be modified; Are there any other good solutions? Thank you for your

cgw · May 10, 2022, 2:40pm

Are you getting a traceback that shows where the FPE (floating-point exception) is happening?

If not, is there a file named core (possibly with numeric suffix) in the run directory? If it is present, you can use it for debugging. Do not upload the file here because it will be large and debugging requires access to the same environment where the core file was produced.

If the file is present - follow the below procedure to get more info.
This is a case where I deliberately introduced a zero-division error in usr_rates.f
Use the file command to find the name of the actual program that ran MFiX, there are several ‘wrappers’ involved and the true program name may vary on different platforms. Whatever the output of file is, use gdb on that executable.

$ cd /tmp/silane_pyrolysis_tfm_2d # project directory
$ ls -l core
-rw------- 1 cgw cgw 1333723136 May 10 09:32 core

$ file core
core: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from '/usr/lib/python-exec/python3.9/python -m mfixgui.pymfix -s -f /tmp/silane_pyrol', real uid: 103, effective uid: 103, real gid: 1000, effective gid: 1000, execfn: '/usr/lib/python-exec/python3.9/python', platform: 'x86_64

$ gdb /usr/lib/python-exec/python3.9/python core 
[lots of output]

Program terminated with signal SIGFPE, Arithmetic exception.
#0  0x00007f68491241cb in usr_rates (ijk=158, rates=...)
    at /tmp/silane_pyrolysis_tfm_2d/usr_rates.f:101
101	         c_SiH4 = RO_g(IJK) * X_g(IJK,SiH4) / (0*Mw_g(SiH4))

(gdb) where
[full stack trace]

Xiaod112 · May 11, 2022, 1:09pm

Thank you. I’ll try your way.But I do not find a file named core (possibly with numeric suffix) in the run directory.And I do not get a traceback that shows where the FPE is happening?This is the example we just stopped.
微信截图_20220511213547

Xiaod112 · May 11, 2022, 2:32pm

This did not happen when I had SMP running. This happens only when DMP is running. But why does the example run slower with SMP than with DMP?
I still can’t figure it out, that’s where the example comes from, what do I do? Thank you.
屏幕截图 2022-05-12 135440

Xiaod112 · May 12, 2022, 6:26am

Yes, it does fail even without the modified files. I have this problem when running files in parallel.

Xiaod112 · May 13, 2022, 9:59am

Do you have any other methods? I’ve tried so many things but it doesn’t work.Thanks.

jeff.dietiker · May 13, 2022, 11:44am

I could not reproduce the issue. It eventually reached DT_MIN but didn’t trigger a floating point exception.
Maybe you can try with MFiX 22.1 see if you have the same issue.

cgw · May 13, 2022, 12:14pm

Please copy and paste the whole stack trace (as text),this screenshot is too small to see the whole stack.

The code at kintheory_mod.f:288 looks like this:

      RE = D_p(IJK,M)*RVEL*ROP_G(IJK)/(MU_G(IJK) + SMALL_NUMBER)
      IF(RE .LE. 1000.d0)THEN
         C_d = (24.d0/(Re+SMALL_NUMBER)) * &
            (ONE + 0.15d0 * Re**0.687D0)

It would be helpful to know what the value of Re is here

I’ll try to take a look at this today, if I can reproduce the crash. But we may need you to help debug this. Also, please try with the latest code as Jeff suggested.

– Charles