MPI run interruption

mohsenclick · April 6, 2022, 4:29pm

Hello,
I’m running MFiX-21.1.4 on a cluster, and my simulations are occasionally halted due to the “Floating point exception” error. I resume the simulation in serial or by altering the domain decomposition combination while keeping the total number of nodes constant. However, I am confronted with the same problem at a later time. I think there should be a problem with the MPI because executing the case in serial never results in this error.
On this cluster, I use gcc/9.3.0 and openmpi/4.0.3.
I should mention that I’m having the same issue with various versions of the code.

I wonder if you could help me on this issue.

Thanks,
Mohsen

Here is the error I get:

Timestep walltime, fluid solver: 1.306 s
Time = 20.287 Dt = 0.40500E-03
Nit P0 P1 V0 V1 G1 U1 Max res
1 1.1E-02 0.7 3.2E-03 4.5E-03 5.2E-02 2.3E-03 P1
2 1.1E-02 2.2E-02 4.8E-03 2.7E-03 2.7E-02 1.4E-03 G1
3 6.8E-03 6.6E-03 2.6E-03 1.6E-03 1.7E-02 8.2E-04 G1
4 4.3E-03 3.1E-03 1.5E-03 9.5E-04 1.0E-02 4.9E-04 G1
5 2.8E-03 1.5E-03 8.9E-04 5.9E-04 6.3E-03 3.0E-04 G1
6 1.8E-03 8.7E-04 5.3E-04 3.7E-04 3.8E-03 1.8E-04 G1
7 1.2E-03 5.2E-04 3.2E-04 2.4E-04 2.2E-03 1.2E-04 G1
8 7.7E-04 3.3E-04 2.0E-04 1.5E-04 1.3E-03 7.5E-05 G1

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0 0x2b6803773c5d in ???
#1 0x2b6803772e95 in ???
#2 0x2b6803bfa97f in ???
#3 0x7363bb in __calc_e_mod_MOD_calc_e_n
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/calc_e.f:156
#4 0x5faef1 in v_m_star
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/solve_vel_star.f:447
#5 0x5ff32c in __solve_vel_star_mod_MOD_solve_vel_star
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/solve_vel_star.f:136
#6 0x52d645 in _iterate_MOD_do_iteration
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/iterate.f:255
#7 0x46b735 in run_fluid
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/mfix.f:188
#8 0x46b735 in run_mfix
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/mfix.f:142
#9 0x46bf8e in mfix
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/mfix.f:298
#10 0x402950 in main
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/mfix.f:269

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun noticed that process rank 56 with PID 141519 on node cdr2174 exited on signal 8 (Floating point exception).

cgw · April 6, 2022, 5:01pm

Hi Mohshen

Let’s take a look at the piece of code in question (from calc_e.f)

   149  !!$omp parallel do private(IJK)
   150        DO IJK = ijkstart3, ijkend3
   151           IF (SIP_AT_N(IJK) .OR. MFLOW_AT_N(IJK)) THEN
   152              E_N(IJK) = ZERO
   153           ELSE
   154              IF ((-A_M(IJK,0,MCP)) /= ZERO) THEN
   155  ! calculating the correction coefficient
   156                 E_N(IJK) = AXZ(IJK)/(-A_M(IJK,0,MCP))
   157              ELSE
   158                 E_N(IJK) = LARGE_NUMBER
   159              ENDIF
   160           ENDIF
   161        ENDDO
   162

Line 156 is where the division happens, I suspect that the denominator is 0. However, at L154 there is a check which should have prevented this (it’s slighly odd that the author is testing the value of -A_M for non-zero instead of just A_M but that’s not the problem).

I wonder if there is some kind of race condition between different processors - does another CPU change the value of A_M after the check but before the division? Also odd is the the !!$omp_parallel at L149, the double ! here means that this takes no effect - OpenMPI directives start with !$omp_parallel but the second ! disables. Not sure why this was added. Maybe try changing line 149 to use !$ instead of !!$ I’m not sure.

mohsenclick · April 6, 2022, 5:46pm

Hi Charles,
I just changed the domain decomposition combination and resumed the simulation. Now, I got an error related to the other lines of the subroutines:

21 1.7E-04 1.9E-04 1.6E-05 3.0E-06 4.7E-06 2.0E-06 P1
22 1.4E-04 1.9E-04 1.2E-05 2.1E-06 3.3E-06 1.3E-06 P1

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0 0x2ad0664dac5d in ???
#1 0x2ad0664d9e95 in ???
#2 0x2ad06696197f in ???
#3 0x7366bb in __calc_e_mod_MOD_calc_e_e
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/calc_e.f:77
#4 0x5fe736 in u_m_star
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/solve_vel_star.f:291
#5 0x5fe736 in __solve_vel_star_mod_MOD_solve_vel_star
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/solve_vel_star.f:134
#6 0x52d645 in _iterate_MOD_do_iteration
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/iterate.f:255
#7 0x46b735 in run_fluid
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/mfix.f:188
#8 0x46b735 in run_mfix
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/mfix.f:142
#9 0x46bf8e in mfix
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/mfix.f:298
#10 0x402950 in main
at /home/moz455/projects/def-spiteri/moz455/mfix-21.1.4/model/mfix.f:269

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun noticed that process rank 50 with PID 227662 on node cdr2195 exited on signal 8 (Floating point exception).

jeff.dietiker · April 7, 2022, 11:44am

It is impossible to tell without your setup files. Please attach your files if you want someone to help.

mohsenclick · April 8, 2022, 2:59pm

Hi @jeff.dietiker ,
Attached are the setup files. I have two modified subroutines that are attached. FYI, I resumed the simulation on a single CPU and it has not been interrupted yet. I am not sure if this could be the cause of issue or not.
Thanks,
Mohsen
calc_grbdry.f (79.9 KB)
source_granular_energy.f (31.1 KB)
sri.mfx (10.5 KB)

jeff.dietiker · April 11, 2022, 6:39pm

Does it also fail if you do not use your modified files?

mohsenclick · April 11, 2022, 7:03pm

Yes, it does fail even without the modified files. I have this problem when running files in parallel.

cgw · April 12, 2022, 12:07pm

Hi. Did you read my reply above? This looks like a race condition to me.

When you get a stack trace, you can examine the code that caused the problem.
(“Use the source, Luke”)

After you changed the domain decomposition, the error moved from line 156 to 77 of calc_e.f but the code there is almost identical:

            IF (A_M(IJK,0,MCP) /= ZERO) THEN
! calculating the correction coefficient
               E_E(IJK) = AYZ(IJK)/(-A_M(IJK,0,MCP))
            ELSE

We shouldn’t be getting a zero-division at 77 because of the check at 75. But perhaps another process is changing the value of A_M(IJK,0,MCP) between the check and the division.

Also note the commented-out !!$omp parallel directive right above that section.

– Charles

FWIW, I’ve been running this case on 8 cores on my local machine, without the modifications. It’s gone for 12 hours of real time and gotten up to simulation time 1.2s without any warnings.

jeff.dietiker · April 12, 2022, 12:45pm

I am also not able to reproduce the failure. I went up to 2.5s of simulation.

cgw · April 12, 2022, 7:08pm

Hi @mohsenclick . We’ve just been discussing this issue.

Neither Jeff nor myself are able to reproduce the crash, but it’s possible we didn’t run for long enough. Some of these problems with long-running DMP jobs are hard to reproduce.

I was mistaken about the nature of the
!!$omp parallel do private(IJK)
directives - these govern SMP behavior, not DMP. So I was barking up the wrong tree there. Jeff also thinks it’s extremely unlikely that it’s a race condition, since each node has a private ranige of IJK indices and A_M array.

So our current guess is that what’s happening at line 156 (or 77) is not a divide-by-zero error but an overflow - if the denominator is small enough, the result will be too large to represent and this can also cause an FPE exception.

We’d like to change the tests from “denominator nonzero” to “denominator greater than epsilon”. Attached is a copy of calc_e.f which is modified in this way -

-            IF (A_M(IJK,0,MCP) /= ZERO) THEN
+            IF (DABS(A_M(IJK,0,MCP)) .ge. SMALL_NUMBER) THEN

Can you replace calc_e.f with the attached file and see if you can provoke the error? This would be very helpful to us.

calc_e.f (9.2 KB)

Thanks,

– Charles

mohsenclick · April 12, 2022, 7:51pm

Hi Charles and Jeff,
I will run my case with the modified calc_e.f subroutine and let you know. I should state that my simulations interrupt a lot when I run them on the ComputeCanada clusters because of this error. Also, running the case on a single CPU decreases the frequency of interruption.
Thanks,
Mohsen

cgw · April 12, 2022, 7:58pm

Of course, the simulation will proceed a lot faster on a cluster than a single core, so the problems will arise faster. From your logs, it looks like the crash happened out at about t=20s, which takes a long time to reach using a single core. Also, it’s possible that this is a DMP-only error…

– Charles

mohsenclick · April 28, 2022, 5:29pm

Hi @cgw, @jeff.dietiker,

I ran the case again using the modified subroutine and I faced no error. I just want to let you know.

Mohsen

Xiaod112 · May 10, 2022, 12:14pm

Hello, I have also encountered this situation when I use DPM and it automatically stops my example after running for a while.

cgw · May 10, 2022, 12:18pm

Hi Dong. It’s better if you don’t post on topics that have already been marked “solved”, it makes it easier to keep track of topics on the forum.

There is not enough information in this screenshot to say what’s going on. Please create a new topic, and attach the complete log files so we can help you.

Thanks,

– Charles

MPI run interruption

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpirun noticed that process rank 50 with PID 227662 on node cdr2195 exited on signal 8 (Floating point exception).

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.