I encountered the similar error recently when I tried to continue a task under “RESTART_1” mode. From the slurm .out file, the program run normally for a period of time and then failed with the following error:
Ju - this is a different error than was originally reported, so it’s OK to create a new topic, especially since the original problem is reported as “solved”.
The exception at 982 indicates that RESID_IJK(IJK) must be NaN, otherwise a simple comparison would not trigger an invalid floating-point error.
We need to find out how the NaN value got in there, but in the meanwhile you can try changing the code in calc_resid.f as follows:
IJK_RESID = 1
MAX_RESID = RESID_IJK( IJK_RESID )
DO IJK = ijkstart3, ijkend3
IF(.NOT.IS_ON_myPE_wobnd(I_OF(IJK),J_OF(IJK), K_OF(IJK))) CYCLE
IF (ISNAN(RESID_IJK(IJK))) CYCLE
IF (RESID_IJK( IJK ) > MAX_RESID) THEN
IJK_RESID = IJK
MAX_RESID = RESID_IJK( IJK_RESID )
ENDIF
ENDDO
This is simply checking for the NaN value and skipping the test if it occurs. That should get you past this crash, but you might run into other problems further down the line, due to the NaN. Try this out and let us know how it goes.
Hi Charles, this time I simulated a new task with the same project as that incurred the problems. And the following similar “floating point” error happened as well at around 0.09s instead of 0.02s before.