Float invalid operation in lround

wangjinjing · July 24, 2024, 8:59am

Hello, everyone！
When running the following case, I encountered a solver crash error at about 3.8 seconds, It indicates that float invalid operation in lround (as shown). But I didn’t find the file lround, I don’t know what it stands for and I don’t know how to modify the case to make it run smoothly, can anyone help me?
Thank you very much!

b0718.mfx (29.8 KB)
usr_rates.f (4.9 KB)
usr_rates_des.f (5.9 KB)

jeff.dietiker · July 24, 2024, 3:20pm

Please try to build the solver with debug flags to see if it points to where it crashes:

cgw · July 24, 2024, 9:56pm

I haven’t reproduced the error yet (the simulation runs pretty slowly) but I did notice the following WARNING:

Warning: Could not find BC.
Check input file to make sure domain boundaries are defined.
DES_POS_NEW:     -0.01940       -0.07859        0.00033
I,J,K:    1   2   9
CLOSEST_PT:     -0.02000       -0.07859        0.00033
NORM_FACE:      1.00000        0.00000        0.00000
Suppressing further warnings.

cgw · July 25, 2024, 3:07pm

A few comments:
Building the solver with debug flags set adds extra checks for array bounds, which helps track down out-of-bounds array accesses. But it probably won’t help track down floating-point exceptions (FPE). And it makes the solver run much more slowly.

The real issue here is that on Windows, the FPE code is only able to print out the innermost stack frame. On Linux, we get a full stack trace. This is a longstanding issue with the Windows platform, which I would like to resolve, but at this time we don’t have a good solution.

I ran the case on Linux but it runs extremely slowly -

it’s been running about 18 hours and has only reached a simulation time of 1.5s … according to your report the crash happens at 3.8s. How long did you have to run the simulation before this happened?

wangjinjing · July 26, 2024, 4:39am

Thank you for your reply. I seem to have been running it for more than two days and it crashed. And is there any way to increase the running speed?

cgw · July 26, 2024, 5:25pm

I ran this case on both Linux and Windows. It runs a little faster on Linux.

Linux:

Windows:

This is running on identical hardware (Lenovo X280).

I have not been able to reproduce the “lround” crash. On Linux, the simulation terminated after about 36 hours, without an error message. Unfortunately, I ran into a known bug in the timekeeping code -

MFiX running, simulation time: 0:00:3.326 elapsed time: 34:29:00
MFiX running, simulation time: 0:00:3.327 elapsed time: 34:30:00
MFiX running, simulation time: 0:00:3.328 elapsed time: 2562082:17:51
MFiX running, simulation time: 0:00:3.329 elapsed time: 2562082:18:00

due to a numerical overflow, the elapsed time jumped from the correct value (34 hours, 30 minutes) to an absurdly high value (about 300 years!) and I believe this triggered the batch time limit - I’m not 100% sure about this. But I reached a simulation time of 3.439 seconds without seeing any FPE errors.

On Windows, the job is still running, I have reached a simulation time of 3.313 s after 42 hours of run time.

Since the lround crash is fairly hard to reproduce, I think the most useful things for me to work on at this point are:

Fix the timekeeping overflow bug
Improve the FPE trap on Windows, to print a full stack trace instead of just the innermost stack frame (this is a bit difficult but maybe I can get it to work)

To make your case run faster, the standard advice is:

Use a coarser mesh
Increase the maxiumum time step
Increase tolerances for convergence
Experiment with different discretization and preconditioning schemes (Numerics pane)
Use more CPUs (this works better on Linux)

I also advise you to look at the WARNINGS printed during the run and take them seriously.

Warning: Could not find BC.
Check input file to make sure domain boundaries are defined.
DES_POS_NEW:     -0.01940       -0.07859        0.00033
I,J,K:    1   2   9
CLOSEST_PT:     -0.02000       -0.07859        0.00033
NORM_FACE:      1.00000        0.00000        0.00000

Hope this helps,
– Charles

wangjinjing · July 27, 2024, 3:41pm

I don’t quite understand how to fix the numerical overflow problem，and I double-checked my boundary conditions and they seem to be fine.Could you please check me again?Thanks.

cgw · July 28, 2024, 1:59pm

To be clear, I do not expect you to fix the overflow problem - that’s on my TODO list. But you need to figure out why you are getting that “Could not find BC” warning.

wangjinjing · July 28, 2024, 2:36pm

But isn’t it the numerical overflow that causes the time to reach an absurd value? And then it makes the case run very slowly.

cgw · July 28, 2024, 3:41pm

No, the case runs very slowly before the time counter overflow occurs.
There are 4 separate problems here:

The “BC not found” warning, which you should investigate and fix
The slow running speed, which I gave you suggestions for
The FPE in lround, which I have not yet been able to reproduce
The timer overflow, which is annoying but does not cause any of the above 3 issues.

I want you to work on #1 and #2, I will address #3 and #4

jeff.dietiker · August 5, 2024, 7:37pm

I was able to trigger a FPE at line 102 of usr_rates.f:

      RATES(R5) = 6.7d12 * exp(-2.4358d4/T_g(IJK)) * &
      EP_g(IJK) * (c_O2**1.3) * (c_CH4**0.2)

My guess is c_O2 or c_CH4 become negative and taking a non-integer power of a negative number triggers the FPE. It is better to only compute the RATES if the concentrations are positive, or above a small threshold to avoid the FPE.

cgw · August 13, 2024, 3:05pm

@jeff.dietiker How long did you run for before getting the FPE?

Your guess is good but it’s odd that the exception occured in the lround function rather than pow which is what I’d expect from non-ingteger power of negative value.

jeff.dietiker · August 13, 2024, 4:45pm

It failed after running for one hour, at t=0.33s. I increased dt_max=1e-2 to get there faster. The FPE I got was in xpow.c, not lround so it is not the same as initially reported.

cgw · August 13, 2024, 8:07pm

Thanks! Note that the original bug report was from Windows which uses a different math library. It’s not exactly the same FPE, but probably related.

cgw · August 14, 2024, 12:26pm

There’s also this warning which hasn’t been addressed yet:

Warning: Could not find BC.
Check input file to make sure domain boundaries are defined.
DES_POS_NEW:     -0.01940       -0.07859        0.00033
I,J,K:    1   2   9
CLOSEST_PT:     -0.02000       -0.07859        0.00033
NORM_FACE:      1.00000        0.00000        0.00000
Suppressing further warnings.

seems like perhaps this should be treated as a fatal error?