Parallel computing:Runtime crash

NJIFE · March 9, 2023, 10:04am

Dear developer,
When I do parallel computing in Linux, MFiX-21.4 always reports the error(as follow) after a few seconds. I don’t know whether the calculation is divergent or other reasons.Could you guide me，thank you very much!
In addition, I would like to ask whether the solid outflow velocity at the mass outlet has an impact on the particle outflow per unit time.

corrupted size vs. prev_size

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
NITs/SEC = 26.93
Timestep walltime, DEM solver: 2.488 s
t= 2.000235 Wrote SPx: 1
,2
,3
,4,5
,6
,7,8,9;
.RES;
#0 0x7f064dc433ff in ???
#1 0x7f064dc4337f in ???
#2 0x7f064dc2ddb4 in ???
#3 0x7f064dc864e6 in ???
#4 0x7f064dc8d5eb in ???
#5 0x7f064dc8de45 in ???
#6 0x7f064dc8f27a in ???
#7 0x7f05ff0b65b3 in __write_res1_des_MOD_write_res_parray_1i
at /home/tong/anaconda3/envs/mfix-21.4/share/mfix/src/model/des/write_res1_des_mod.f:464
#8 0x7f05ff2c5936 in __write_res0_des_mod_MOD_write_res0_des
at /home/tong/anaconda3/envs/mfix-21.4/share/mfix/src/model/des/write_res0_des.f:54
#9 0x7f05ff110f54 in _output_man_MOD_output_manager
at /home/tong/anaconda3/envs/mfix-21.4/share/mfix/src/model/output_manager.f:162
#10 0x7f05ff0fde46 in run_mfix
at /home/tong/anaconda3/envs/mfix-21.4/share/mfix/src/model/mfix.f:166
#11 0x7f05fef34f32 in __main_MOD_run_mfix0
at /home/tong/mfix/continuous-moving-bed/build/pymfix/main.f90:81
#12 0x7f05fef29056 in f2py_rout_mfixsolver_main_run_mfix0
at /home/tong/mfix/continuous-moving-bed/build/f2pywrappers/mfixsolvermodule.c:1353
#13 0x55e6944ef540 in _PyObject_MakeTpCall
at /usr/local/src/conda/python-3.8.15/Objects/call.c:159
#14 0x55e6944eb373 in _PyObject_Vectorcall
at /usr/local/src/conda/python-3.8.15/Include/cpython/abstract.h:125
#15 0x55e6944eb373 in _PyObject_Vectorcall
at /usr/local/src/conda/python-3.8.15/Include/cpython/abstract.h:115
#16 0x55e6944eb373 in call_function
at /usr/local/src/conda/python-3.8.15/Python/ceval.c:4963
#17 0x55e6944eb373 in _PyEval_EvalFrameDefault
at /usr/local/src/conda/python-3.8.15/Python/ceval.c:3469
#18 0x55e6944f71d5 in PyEval_EvalFrameEx
at /usr/local/src/conda/python-3.8.15/Python/ceval.c:741
#19 0x55e6944f71d5 in function_code_fastcall
at /usr/local/src/conda/python-3.8.15/Objects/call.c:284
#20 0x55e6944f71d5 in _PyFunction_Vectorcall
at /usr/local/src/conda/python-3.8.15/Objects/call.c:411
#21 0x55e6944e6c84 in _PyObject_Vectorcall
at /usr/local/src/conda/python-3.8.15/Include/cpython/abstract.h:127
#22 0x55e6944e6c84 in call_function
at /usr/local/src/conda/python-3.8.15/Python/ceval.c:4963
#23 0x55e6944e6c84 in _PyEval_EvalFrameDefault
at /usr/local/src/conda/python-3.8.15/Python/ceval.c:3486
#24 0x55e6944f71d5 in PyEval_EvalFrameEx
at /usr/local/src/conda/python-3.8.15/Python/ceval.c:741
#25 0x55e6944f71d5 in function_code_fastcall
at /usr/local/src/conda/python-3.8.15/Objects/call.c:284
#26 0x55e6944f71d5 in _PyFunction_Vectorcall
at /usr/local/src/conda/python-3.8.15/Objects/call.c:411
#27 0x55e6944e6c84 in _PyObject_Vectorcall
at /usr/local/src/conda/python-3.8.15/Include/cpython/abstract.h:127
#28 0x55e6944e6c84 in call_function
at /usr/local/src/conda/python-3.8.15/Python/ceval.c:4963
#29 0x55e6944e6c84 in _PyEval_EvalFrameDefault
at /usr/local/src/conda/python-3.8.15/Python/ceval.c:3486
#30 0x55e6944f71d5 in PyEval_EvalFrameEx
at /usr/local/src/conda/python-3.8.15/Python/ceval.c:741
#31 0x55e6944f71d5 in function_code_fastcall
at /usr/local/src/conda/python-3.8.15/Objects/call.c:284
#32 0x55e6944f71d5 in _PyFunction_Vectorcall
at /usr/local/src/conda/python-3.8.15/Objects/call.c:411
#33 0x55e69450695a in _PyObject_Vectorcall
at /usr/local/src/conda/python-3.8.15/Include/cpython/abstract.h:127
#34 0x55e69450695a in method_vectorcall
at /usr/local/src/conda/python-3.8.15/Objects/classobject.c:67
#35 0x55e6945090a1 in PyVectorcall_Call
at /usr/local/src/conda/python-3.8.15/Objects/call.c:200
#36 0x55e6945090a1 in PyObject_Call
at /usr/local/src/conda/python-3.8.15/Objects/call.c:228
#37 0x55e6945dcc38 in t_bootstrap
at /usr/local/src/conda/python-3.8.15/Modules/_threadmodule.c:1002
#38 0x55e6945dcb83 in pythread_wrapper
at /usr/local/src/conda/python-3.8.15/Python/thread_pthread.h:232
#39 0x7f064e96b179 in ???
#40 0x7f064dd08dc2 in ???
#41 0xffffffffffffffff in ???
/home/tong/mfix/continuous-moving-bed/mfixsolver: 行 3: 3017370 已放弃 (核心已转储)env LD_PRELOAD=/usr/lib64/openmpi/lib/libmpi.so LD_LIBRARY_PATH=${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}“” PYTHONPATH=“/home/tong/mfix/continuous-moving-bed”:“”:${PYTHONPATH:+:$PYTHONPATH} /home/tong/anaconda3/envs/mfix-21.4/bin/python3.8 -m mfixgui.pymfix “$@”

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[10150,1],10]
Exit code: 134

Previous MFiX run is resumable. Reset job to edit model
MFiX process has stopped
continuous-moving-bed_2023-03-09T173245.857125.zip (1.2 MB)

cgw · March 10, 2023, 12:44am

Hello Tong -

Thanks for the bug report. We are looking into this failure now. I haven’t reproduced the crash, but I noticed that you were running on 20 cores and the crash happened at t= 2.000235 and I haven’t reached that point in the simulation yet. While it runs, here are a few comments:

When DMP jobs fail, it is generally a good idea to run the same job in serial (no DMP) to see if the failure is DMP-only
You are runnng a version of MFiX which is a year old. It is possible that this bug has been fixed. It’s also a good idea to upgrade MFiX if you are having trouble.
Doing a Google search for “corrupted size vs. prev_size”, this is an error message from the system library “glibc” which indicates a problem with writing outside a valid memory area. (Not as severe as a segmentation fault).
The fact that this happened at t= 2.000235 is suspicious. Very close to 2s. You have two SPX outputs (Reaction rates and Turbulence quantities) that are set to write at a frequency of 1s (simulation time) and it seems like the error is associated with one of those outputs.
For what it’s worth this is the code where the error happens:

   413  !``````````````````````````````````````````````````````````````````````!
   414  ! Subroutine: WRITE_RES_PARRAY_1I                                      !
   415  !                                                                      !
   416  ! Purpose: Write scalar integers to RES file.                          !
   417  !``````````````````````````````````````````````````````````````````````!
   418        SUBROUTINE WRITE_RES_PARRAY_1I(lNEXT_REC, INPUT_I, pLOC2GLB)
   419  
   420        INTEGER, INTENT(INOUT) :: lNEXT_REC
   421        INTEGER, INTENT(IN) :: INPUT_I(:)
   422        LOGICAL, INTENT(IN), OPTIONAL :: pLOC2GLB
   423  
   424        LOGICAL :: lLOC2GLB
   425  ! Loop counters
   426        INTEGER :: LC1, LC2
   427  
   428        lLOC2GLB = .FALSE.
   429        IF(present(pLOC2GLB)) lLOC2GLB = pLOC2GLB
   430  
   431        allocate(iPROCBUF(pPROCCNT))
   432        allocate(iROOTBUF(pROOTCNT))
   433  
   434        iDISPLS = pDISPLS
   435        iGath_SendCnt = pSEND
   436        iGatherCnts   = pGATHER
   437  
   438        IF(bDIST_IO) THEN
   439           LC1 = 1
   440  
   441           IF(lLOC2GLB) THEN
   442              DO LC2 = 1, MAX_PIP
   443                 IF(LC1 > PIP) EXIT
   444                 IF(IS_NONEXISTENT(LC1)) CYCLE
   445                 iProcBuf(LC1) = iGLOBAL_ID(INPUT_I(LC2))
   446                 LC1 = LC1 + 1
   447              ENDDO
   448           ELSE
   449              DO LC2 = 1, MAX_PIP
   450                 IF(LC1 > PIP) EXIT
   451                 IF(IS_NONEXISTENT(LC1)) CYCLE
   452                 iProcBuf(LC1) = INPUT_I(LC2)
   453                 LC1 = LC1 + 1
   454              ENDDO
   455           ENDIF
   456           CALL OUT_BIN_512i(RDES_UNIT, iProcBuf, pROOTCNT, lNEXT_REC)
   457  
   458        ELSE
   459           CALL DES_GATHER(INPUT_I, lLOC2GLB)
   460           IF(myPE == PE_IO) &
   461              CALL OUT_BIN_512i(RDES_UNIT,iROOTBUF, pROOTCNT, lNEXT_REC)
   462        ENDIF
   463  
   464        deallocate(iPROCBUF)
   465        deallocate(iROOTBUF)
   466  
   467        RETURN
   468        END SUBROUTINE WRITE_RES_PARRAY_1I

This is the code that writes RES files, not SPX as I conjectured above. (It might be used by both, I’m not sure).

The error is at 464 where the array iPROCBUF is being deallocated. Something clobbered the control area between the time the array was allocated and the time it is deallocated.

This error has not been reported before, to the best of my knowledge. We will try to replicate and fix it. If you have any other idea about how to trigger this error, it would be helpful.

cgw · March 10, 2023, 1:03am

@NJIFE
It would also be helpful to see the output of

gfortran --version

and

mpirun --version

Thanks.

cgw · March 10, 2023, 1:12am

One more comment - you have a usr_rates_des.f in your project directory. It’s possible that the memory corruption is happening in your code - if you are writing outside an array, it might result in the error seen above. Try disabling usr_rates_des.f and see if that makes the crash go away.

cgw · March 10, 2023, 1:39pm

I was able to reproduce this with the current code running on 8 cores. After an hour and a half of real time:

MFiX running: elapsed time 1:26:44  
Error: Solver crash!
double free or corruption (!prev)
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0 write_res1_des_MOD_write_res_parray_1d
        at des/write_res1_des_mod.f:508

This was at simulation time t=0.999966

cgw · March 10, 2023, 1:41pm

I am using gfortran 12.2.1 and mpirun 4.1.4 which are the most current available so the problem is not due to old gfortran/openMPI.

NJIFE · March 11, 2023, 3:06am

Dear cgw,
Thank you and we highly appreciate for your patient analysis. I have tried serial operation. At present, the working condition of 8.2s has been calculated, and the operation is relatively stable. I suspect that parallel computing caused the crash. But I don’t know how to solve it. Parallel computing is still very important for us.

NJIFE · March 17, 2023, 1:33am

@cgw Senior, what else may have caused this error？

cgw · March 17, 2023, 11:23am

We have not tracked down the cause of this error. It’s a bit of a mystery at the moment, we’ll let you know if we find out more.

jeff.dietiker · March 17, 2023, 12:57pm

When I run your case in debug mode, it fails very early with an error that typically occurs when particles are too soft. Can you try to increase the spring stiffness to see if it helps.

You also have a mass outlet BC. Can you change this to a pressure outlet and see if this makes a difference?

NJIFE · March 18, 2023, 7:20am

OK, thank you very much!

NJIFE · March 18, 2023, 7:22am

OK, thank you so much.

Parallel computing:Runtime crash

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[10150,1],10] Exit code: 134

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[10150,1],10]
Exit code: 134