SMP-only Floating-point exception after modifying wall collision

Hi MFIX community,

I’m running a bubbling fluidized bed case with MFIX-25.3 (DEM) in WSL2 Ubuntu22.04. To account for a distribution of adhesive forces in real particles, I modified the particle contact logic and linked the particle “roughness asperity height” to des_usr_var(1), so that each particle has its own asperity height.

MFIX bugreport.zip (1.4 MB)

Issue description:

At simulation time t = 0.003954 s, the run stops with: “Floating-point exception - erroneous arithmetic operation”, The error points to my modified code:calc_collision_wall_mod.f:269.

What confuses me is that I changed the asperity height from a constant 0.1 μm to a distributed range 13 nm – 0.76 μm. From the calculation logic, this range should not introduce invalid operations (e.g., division by zero), so I suspect something else is happening.

Questions

  1. If it’s right way to introduce asperity distribution in MFIX-DEM code.

  2. What would be the recommended way to avoid or debug this floating-point exception ?

Thanks a lot for any suggestions

Strong dependency on SMP / hardware

I found this is highly dependent on the machine and SMP settings:

  • The case errors only on one specific server when using SMP parallelization with 32 cores.

  • Running the same case on that server with 1 core does not crash.

  • Running the same case on other computers does not crash. However, when I change operating conditions (e.g., gas velocity), I may still encounter a similar Floating-point exception later, even on other machines.

  • Run with the default (unmodified) solver/code, the simulation does not crash

So I currently suspect my modification may have some conflict with the SMP parallel strategy.

Additional observation

During the simulation, I notice the number of particles gradually decreases. As shown in the animation, some particles appear to stick to the wall and their velocity becomes zero. Does this imply those particles are leaving the computational domain), or is it simply that they are immobilized but still in the domain?

Supplyments

  1. The case origins from The effect of cohesive forces on the fluidization of aeratable powders - Galvin - 2014 - AIChE Journal - Wiley Online Library
  2. error report
  3. Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

    Backtrace for this error:
    #0 0x7fdc8bab251f in ???
    #1 0x55a2bc8b186a in __calc_collision_wall_MOD_calc_dem_force_with_wall_stl._omp_fn.0
    at /home/cfd/mfix/elec/Asperties_distribution/calc_collision_wall_mod.f:269
    #2 0x7fdc8c269a65 in GOMP_parallel
    at ../../../libgomp/parallel.c:178
    #3 0x55a2bc8b6c12 in __calc_collision_wall_MOD_calc_dem_force_with_wall_stl
    at /home/cfd/mfix/elec/Asperties_distribution/calc_collision_wall_mod.f:184
    #4 0x55a2bc8bbc4b in __calc_force_dem_mod_MOD_calc_force_dem
    at /home/cfd/mfix/elec/Asperties_distribution/calc_force_dem.f:116
    #5 0x55a2bc647c83 in _des_time_march_MOD_des_time_step
    at /home/cfd/miniforge3/envs/mfix-25.3/share/mfix/src/model/des/des_time_march.f:352

    #6 0x55a2bc781918 in run_dem
    at /home/cfd/miniforge3/envs/mfix-25.3/share/mfix/src/model/mfix.f:214

    #7 0x55a2bc781918 in run_mfix
    at /home/cfd/miniforge3/envs/mfix-25.3/share/mfix/src/model/mfix.f:149
    #8 0x55a2bc781dd2 in mfix
    at /home/cfd/miniforge3/envs/mfix-25.3/share/mfix/src/model/mfix.f:326
    #9 0x55a2bc43e772 in main
    at /home/cfd/miniforge3/envs/mfix-25.3/share/mfix/src/model/mfix.f:296
    Floating point exception (core dumped)

Here’s your line 269

                    DistApart = (SQRT(DISTSQ)-R_LM)
                    FORCE_COH = WALL_HAMAKER_CONSTANT*DES_RADIUS(LL) /&
                       (6.0d0*DistApart**2)*(asp_local/(asp_local +  &
                       DES_RADIUS(LL)) + ONE/(ONE+asp_local/         &
                       DistApart)**2)

I always advise breaking apart complex calculations like this FORCE_COH Use some intermediate expressions and check for 0 before dividing. Also note that dividing by an extremely small value can cause an overflow. write(*, *) statements are your friend

It is odd however that this only happens in SMP. Check the !$omp(parallel) directives… they look OK to me on first glance but maybe something is being shared that shouldn’t be….

Finally, are you getting a core dump? if so

$ gdb python core
>>> up
>>> info locals
>>> print DistApart
# You may have to do
>>> info threads
## then select the thread where the error occured
>>> thread N

(you may have to do a debug build to get debugger symbols. But when you are having issues a debug build is a good idea anyhow)

Good luck, please let us know if you get this sorted out!

– Charles

1 Like

If you run the most recent MFiX version (25.3) the Dashboard will show you total # of particles and # of particles removed from the domain. You should be getting some warning about “Rogue particles” if they are being removed.

Charles, thank you for the suggestion. I was able to capture the crash from core dump. The error is:

  • “Thread 1 received SIGFPE”

  • At the crash location: LL = 76, distsq = 3.07e-09

  • asp_local = -1.5152478981397783e+213

  • while discretelement::des_usr_var(1,LL) = 1.045518667e-07

In the code I assign asp_local = des_usr_var(1,LL), so these values should match, but they clearly do not at the time.

I also noticed a key difference in OpenMP scoping: in calc_collision_wall_mod the des_usr_var listed in the !$omp private(...) variables, while in calc_force_dem it is treated as shared(...). This makes me suspect that, inside the OpenMP parallel region, each thread has its own private copy of des_usr_var and that private copy is not initialized (so it contains undefined values), which would explain why asp_local becomes an absurd number even though the module variable looks correct.

Does this diagnosis sound correct to you, or is there another potential issue I should consider?

This seems consistent with the OpenMP specification wording:

  • “The initial value of the new list item is undefined. The initial status of a private pointer is undefined.” —List Item Privatization

My next step is to remove des_usr_var from the private(...) list and keep it shared in calc_collision_wall_mod, then rerun to see if the SIGFPE disappears.

1 Like

Great, it looks like you are making good progress with the debugger. It’s encouraging that you are able to get useful information from GDB in the WSL environment. We don’t have a good debugging environment in the native Windows build, so perhaps we should encourage our Windows users to use WSL. (We have not spent much time experimenting with WSL).

I think you are probably right about the uninitialized value. Please try adjusting the
!$omp declarations and see what happens. Please let us know if there is something in the MFiX code we need to fix! SMP programming is notoriously error-prone…

Finally, I still advise breaking apart a complex calculation like

FORCE_COH = WALL_HAMAKER_CONSTANT*DES_RADIUS(LL) /  &
      (6.0d0*DistApart**2)*(asp_local/(asp_local +  &
      DES_RADIUS(LL)) + ONE/(ONE+asp_local/         &
      DistApart)**2)

into several smaller calculations (using intermediate variables).

Please keep us updated on your progress.