Hello,
I’m running MFiX-21.1.4 on a cluster, and my simulations are occasionally halted due to the “Floating point exception” error. I resume the simulation in serial or by altering the domain decomposition combination while keeping the total number of nodes constant. However, I am confronted with the same problem at a later time. I think there should be a problem with the MPI because executing the case in serial never results in this error.
On this cluster is linux centos7
3dtfm_2022-05-10T203519.341493.zip (1.1 MB)
, I use openmpi/4.1.7.
Does it fail without your modified drag_gs.f
file?
I haven’t tried it yet, but in my simulations this file will have to be modified; Are there any other good solutions? Thank you for your
Are you getting a traceback that shows where the FPE (floating-point exception) is happening?
If not, is there a file named core
(possibly with numeric suffix) in the run directory? If it is present, you can use it for debugging. Do not upload the file here because it will be large and debugging requires access to the same environment where the core file was produced.
If the file is present - follow the below procedure to get more info.
This is a case where I deliberately introduced a zero-division error in usr_rates.f
Use the file
command to find the name of the actual program that ran MFiX, there are several ‘wrappers’ involved and the true program name may vary on different platforms. Whatever the output of file
is, use gdb
on that executable.
$ cd /tmp/silane_pyrolysis_tfm_2d # project directory
$ ls -l core
-rw------- 1 cgw cgw 1333723136 May 10 09:32 core
$ file core
core: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from '/usr/lib/python-exec/python3.9/python -m mfixgui.pymfix -s -f /tmp/silane_pyrol', real uid: 103, effective uid: 103, real gid: 1000, effective gid: 1000, execfn: '/usr/lib/python-exec/python3.9/python', platform: 'x86_64
$ gdb /usr/lib/python-exec/python3.9/python core
[lots of output]
Program terminated with signal SIGFPE, Arithmetic exception.
#0 0x00007f68491241cb in usr_rates (ijk=158, rates=...)
at /tmp/silane_pyrolysis_tfm_2d/usr_rates.f:101
101 c_SiH4 = RO_g(IJK) * X_g(IJK,SiH4) / (0*Mw_g(SiH4))
(gdb) where
[full stack trace]
Thank you. I’ll try your way.But I do not find a file named core
(possibly with numeric suffix) in the run directory.And I do not get a traceback that shows where the FPE is happening?This is the example we just stopped.
This did not happen when I had SMP running. This happens only when DMP is running. But why does the example run slower with SMP than with DMP?
I still can’t figure it out, that’s where the example comes from, what do I do? Thank you.
Yes, it does fail even without the modified files. I have this problem when running files in parallel.
Do you have any other methods? I’ve tried so many things but it doesn’t work.Thanks.
I could not reproduce the issue. It eventually reached DT_MIN but didn’t trigger a floating point exception.
Maybe you can try with MFiX 22.1 see if you have the same issue.
Please copy and paste the whole stack trace (as text),this screenshot is too small to see the whole stack.
The code at kintheory_mod.f:288 looks like this:
RE = D_p(IJK,M)*RVEL*ROP_G(IJK)/(MU_G(IJK) + SMALL_NUMBER)
IF(RE .LE. 1000.d0)THEN
C_d = (24.d0/(Re+SMALL_NUMBER)) * &
(ONE + 0.15d0 * Re**0.687D0)
It would be helpful to know what the value of Re
is here
I’ll try to take a look at this today, if I can reproduce the crash. But we may need you to help debug this. Also, please try with the latest code as Jeff suggested.
– Charles