Hello,
I’m running MFiX-21.1.4 on a cluster, and my simulations are occasionally halted due to the “Floating point exception” error. I resume the simulation in serial or by altering the domain decomposition combination while keeping the total number of nodes constant. However, I am confronted with the same problem at a later time. I think there should be a problem with the MPI because executing the case in serial never results in this error.
On this cluster, I use gcc/9.3.0 and openmpi/4.0.3.
I should mention that I’m having the same issue with various versions of the code.
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error: #0 0x2b6803773c5d in ??? #1 0x2b6803772e95 in ??? #2 0x2b6803bfa97f in ??? #3 0x7363bb in __calc_e_mod_MOD_calc_e_n
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/calc_e.f:156 #4 0x5faef1 in v_m_star
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/solve_vel_star.f:447 #5 0x5ff32c in __solve_vel_star_mod_MOD_solve_vel_star
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/solve_vel_star.f:136 #6 0x52d645 in _iterate_MOD_do_iteration
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/iterate.f:255 #7 0x46b735 in run_fluid
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/mfix.f:188 #8 0x46b735 in run_mfix
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/mfix.f:142 #9 0x46bf8e in mfix
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/mfix.f:298 #10 0x402950 in main
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/mfix.f:269
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 56 with PID 141519 on node cdr2174 exited on signal 8 (Floating point exception).
Let’s take a look at the piece of code in question (from calc_e.f)
149 !!$omp parallel do private(IJK)
150 DO IJK = ijkstart3, ijkend3
151 IF (SIP_AT_N(IJK) .OR. MFLOW_AT_N(IJK)) THEN
152 E_N(IJK) = ZERO
153 ELSE
154 IF ((-A_M(IJK,0,MCP)) /= ZERO) THEN
155 ! calculating the correction coefficient
156 E_N(IJK) = AXZ(IJK)/(-A_M(IJK,0,MCP))
157 ELSE
158 E_N(IJK) = LARGE_NUMBER
159 ENDIF
160 ENDIF
161 ENDDO
162
Line 156 is where the division happens, I suspect that the denominator is 0. However, at L154 there is a check which should have prevented this (it’s slighly odd that the author is testing the value of -A_M for non-zero instead of just A_M but that’s not the problem).
I wonder if there is some kind of race condition between different processors - does another CPU change the value of A_M after the check but before the division? Also odd is the the !!$omp_parallel at L149, the double ! here means that this takes no effect - OpenMPI directives start with !$omp_parallel but the second ! disables. Not sure why this was added. Maybe try changing line 149 to use !$ instead of !!$ I’m not sure.
Hi Charles,
I just changed the domain decomposition combination and resumed the simulation. Now, I got an error related to the other lines of the subroutines:
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error: #0 0x2ad0664dac5d in ??? #1 0x2ad0664d9e95 in ??? #2 0x2ad06696197f in ??? #3 0x7366bb in __calc_e_mod_MOD_calc_e_e
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/calc_e.f:77 #4 0x5fe736 in u_m_star
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/solve_vel_star.f:291 #5 0x5fe736 in __solve_vel_star_mod_MOD_solve_vel_star
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/solve_vel_star.f:134 #6 0x52d645 in _iterate_MOD_do_iteration
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/iterate.f:255 #7 0x46b735 in run_fluid
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/mfix.f:188 #8 0x46b735 in run_mfix
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/mfix.f:142 #9 0x46bf8e in mfix
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/mfix.f:298 #10 0x402950 in main
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/mfix.f:269
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 50 with PID 227662 on node cdr2195 exited on signal 8 (Floating point exception).
Hi @jeff.dietiker ,
Attached are the setup files. I have two modified subroutines that are attached. FYI, I resumed the simulation on a single CPU and it has not been interrupted yet. I am not sure if this could be the cause of issue or not.
Thanks,
Mohsen calc_grbdry.f (79.9 KB) source_granular_energy.f (31.1 KB) sri.mfx (10.5 KB)
Hi. Did you read my reply above? This looks like a race condition to me.
When you get a stack trace, you can examine the code that caused the problem.
(“Use the source, Luke”)
After you changed the domain decomposition, the error moved from line 156 to 77 of calc_e.f but the code there is almost identical:
IF (A_M(IJK,0,MCP) /= ZERO) THEN
! calculating the correction coefficient
E_E(IJK) = AYZ(IJK)/(-A_M(IJK,0,MCP))
ELSE
We shouldn’t be getting a zero-division at 77 because of the check at 75. But perhaps another process is changing the value of A_M(IJK,0,MCP) between the check and the division.
Also note the commented-out !!$omp parallel directive right above that section.
– Charles
FWIW, I’ve been running this case on 8 cores on my local machine, without the modifications. It’s gone for 12 hours of real time and gotten up to simulation time 1.2s without any warnings.
Hi @mohsenclick . We’ve just been discussing this issue.
Neither Jeff nor myself are able to reproduce the crash, but it’s possible we didn’t run for long enough. Some of these problems with long-running DMP jobs are hard to reproduce.
I was mistaken about the nature of the !!$omp parallel do private(IJK)
directives - these govern SMP behavior, not DMP. So I was barking up the wrong tree there. Jeff also thinks it’s extremely unlikely that it’s a race condition, since each node has a private ranige of IJK indices and A_M array.
So our current guess is that what’s happening at line 156 (or 77) is not a divide-by-zero error but an overflow - if the denominator is small enough, the result will be too large to represent and this can also cause an FPE exception.
We’d like to change the tests from “denominator nonzero” to “denominator greater than epsilon”. Attached is a copy of calc_e.f which is modified in this way -
- IF (A_M(IJK,0,MCP) /= ZERO) THEN
+ IF (DABS(A_M(IJK,0,MCP)) .ge. SMALL_NUMBER) THEN
Can you replace calc_e.f with the attached file and see if you can provoke the error? This would be very helpful to us.
Hi Charles and Jeff,
I will run my case with the modified calc_e.f subroutine and let you know. I should state that my simulations interrupt a lot when I run them on the ComputeCanada clusters because of this error. Also, running the case on a single CPU decreases the frequency of interruption.
Thanks,
Mohsen
Of course, the simulation will proceed a lot faster on a cluster than a single core, so the problems will arise faster. From your logs, it looks like the crash happened out at about t=20s, which takes a long time to reach using a single core. Also, it’s possible that this is a DMP-only error…
Hi Dong. It’s better if you don’t post on topics that have already been marked “solved”, it makes it easier to keep track of topics on the forum.
There is not enough information in this screenshot to say what’s going on. Please create a new topic, and attach the complete log files so we can help you.