Thanks Charles! I will continue as you mentioned. I hooe it can be solved soon. I will also update you if anything new happens.
@cgw
Charles,
As an update, I started to build my case based on the MFiX 3D FB tutorial and I modified that simulation step by step to see what causes the error. It would be definitely said it occurs due to using the Srivastava frictional model.
I have changed the model to the Schaeffer model without changing the ep_star value and it is running for 14 hours without any issues.
I also checked the grid size effect and I think that the error is not related to the grid size.
Hope this helps.
@cgw
Attached is the case that I ran for several days on DMP using SCHAEFFER model without any issue. Yet, changing the model to SRIVASTAVA contributes to the segmentation fault error.fluid_bed_tfm_3d.mfx (11.4 KB)
Hi Charles and mohsen, I met the same problem.dem_3d_LKTJH_Muller_Gidaspow_2021_4_1_vsmpi.zip (26.2 MB)
No matter using GUI or command line with 10 processes, an error will be reported after running about 0.017 seconds.
In command line, the error appears
In GUI, the error appears
But with 1 process, everything is fine, I have run 10 seconds.
I hope this case is helpful for you.
Best regards
xjwuyuhao
2021.04.01
Yuhao -
I tried running the LKTJH_Muller_Gidaspow_vsmpi model you attached above - I recompiled it to run on my machine and I got a different error:
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
#3 0x7f6f358a17d3 in write_gran_temp_
at /tmp/dem_3d_LKTJH_Muller_Gidaspow_2021_4_1_vsmpi/usr0_des.f:233
#4 0x7f6f358a18af in usr0_des_
at /tmp/dem_3d_LKTJH_Muller_Gidaspow_2021_4_1_vsmpi/usr0_des.f:27
#5 0x7f6f35e8f315 in __des_time_march_MOD_des_time_init
The line is:
N_Fd = (-VOLPARTICLE * P_Grad_diff + DgA * VRELA) / &
(3.0d0 * pi * Mug * DPM * ep_g(IJK) * VRELA)
itās probably a simple zero-division error.
Hi Mohsen, Iām looking at this issue again, so far I have not been able to reproduce it on my machine (with Srivastava friction model). How long did you have to let it run before you saw the error?
Hi, It happened in less than 30 minutes. I think you got this error according to your posts too. I can re-run it if it is required?
I got the error with the ālavermanā file but not (yet) with the āfluid_bed_tfm_3dā.
Let me run it again.
I re-run it on my machine several minutes ago and I am waiting for the error. Also, I changed the frictional model from SCHAEFFER to SRIVASTAVA in the same āfluid_bed_tfm_3dā that is running for more than 2 weeks on a cluster and it reproduced the error in less than a minute.
I reproduced the error again using the āfluid_bed_tfm_3dā at Time = 0.02474:
Timestep walltime, fluid solver: 21.853 s
Time = 0.24742E-01 Dt = 0.50000E-03
Nit HYDRO THETA Max res
1 0.4 3.4E-02 P1
At line 43 of file fun_avg.inc
Fortran runtime error: Index ā987654321ā of dimension 1 of array āfxā above upper bound of 123
Error termination. Backtrace:
#0 0x7f7f4da8cd01 in ???
#1 0x7f7f4da8d849 in ???
#2 0x7f7f4da8dec6 in ???
#3 0x5598f4e91066 in __fun_avg_MOD_avg_x_e
at /home/mzarepour/Documents/MFiX/build/model/fun_avg.inc:43
#4 0x5598f525a716 in __calc_s_ddot_mod_MOD_get_neighbor_vel_at_wall
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/calc_s_ddot_s.f:630
#5 0x5598f527d59c in __calc_s_ddot_mod_MOD_calc_s_ddot_s
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/calc_s_ddot_s.f:20
#6 0x5598f5224ce6 in __calc_gr_boundary_MOD_calc_grbdry
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/calc_grbdry.f:148
#7 0x5598f504236f in __source_u_s_mod_MOD_jj_bc_u_s
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/source_u_s.f:1072
#8 0x5598f504cedf in __source_u_s_mod_MOD_source_u_s_bc
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/source_u_s.f:792
#9 0x5598f505d842 in __source_u_s_mod_MOD_source_u_s
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/source_u_s.f:466
#10 0x5598f4ffb257 in u_m_star
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/solve_vel_star.f:267
#11 0x5598f4ff6e20 in __solve_vel_star_mod_MOD_solve_vel_star
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/solve_vel_star.f:134
#12 0x5598f4e9ad68 in _iterate_MOD_do_iteration
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/iterate.f:255
#13 0x5598f4d82f39 in run_fluid
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/mfix.f:188
#14 0x5598f4d82c4b in run_mfix
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/mfix.f:142
#15 0x5598f4d83724 in mfix
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/mfix.f:315
#16 0x5598f4d83e4a in main
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/mfix.f:269
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[44185,1],8]
Exit code: 2
Hi, cgw. Thanks for your help. However, before the line you give, I have imposed restrictions:
Pi and DPM arenāt equal to zero. Mug equals to Mu_g0 given in GUI, which is not zero as well. To VRELA,
it isnāt equal to zero. In the formula for calculating N_Fd, ep_g(IJK) is asked to .GT. zero.
So the parameters of the denominator are all greater than zero. I think itās not a zero-division error.
Whatās more, if the case runs with 1 process, everything is fine. If itās a zero-division error, I consider that it should not run with 1 process.
Best regards
xjwuyuhao
2021.04.15
Yuhao - sorry for the delayed response - Iāve been traveling.
When in doubt, print more numbers out. I modified your code to compute the numerator and denominator of the fraction separately. Itās sometimes helpful to break up computations like this, so you can just test if the denominator is zero before attempting to divide. I think that your checks for zero are not sufficient, since if either DPM
or VRELA
are 0, you get a 0 in the denominator. Instead of trying to check every factor, just compute the denominator and check it, like so:
if((ep_s(IJK,1) .GT. ZERO) .AND. (Mug .GT. ZERO) &
.AND. (ep_g(IJK) .GT. ZERO)) then
!if (DEAD_CELL_AT(I,J,K)) cycle
N_Power = -VOLPARTICLE * P_Grad_diff * VPARTICLE + &
DgA * VRELA * VPARTICLE
write(*,*) "Mug=", Mug
write(*,*) "DPM=", DPM
write(*,*) "ep_g(", IJK, ")=", ep_g(IJK)
write(*,*) "VRELA=", VRELA
NUM = (-VOLPARTICLE * P_Grad_diff + DgA * VRELA)
write(*,*) "NUM=", NUM
DENOM = (3.0d0 * pi * Mug * DPM * ep_g(IJK) * VRELA)
write(*,*) "DENOM=", DENOM
if (DENOM .EQ. 0) then
write(*,*) "WE HAVE A PROBLEM HERE!"
else
write(*,*) "FRACTION=", NUM/DENOM
N_Fd = NUM/DENOM
end if
Running this I get:
Mug= 1.8499999999999999E-005
DPM= 1.1999999999999999E-003
ep_g( 815 )= 0.55539130995879105
VRELA= 1.2600606582530112
NUM= 1.8629709923230549E-004
Mug= 1.8499999999999999E-005
DPM= 0.0000000000000000
ep_g( 0 )= 1.1931191281420263E-319
VRELA= 0.0000000000000000
NUM= 0.0000000000000000
DENOM= 0.0000000000000000
WE HAVE A PROBLEM HERE!
VRELA
is a square root of a sum of squares, so itās non-negative, but I donāt see why you say itās nonzero. And thereās not a nonzero test for DPM
anywhere that I can see.
You still canāt divide 0 by 0 and I donāt think the checks you had were sufficient to prevent this.
I hope this helps, please let me know!
- Charles
That output is a little confusing since the different processes are all writing to the same output and itās getting mixed together. Hereās an improved version, where we only write data on processing element (PE) number 1:
if((ep_s(IJK,1) .GT. ZERO) .AND. (Mug .GT. ZERO) &
.AND. (ep_g(IJK) .GT. ZERO)) then
!if (DEAD_CELL_AT(I,J,K)) cycle
N_Power = -VOLPARTICLE * P_Grad_diff * VPARTICLE + &
DgA * VRELA * VPARTICLE
NUM = (-VOLPARTICLE * P_Grad_diff + DgA * VRELA)
DENOM = (3.0d0 * pi * Mug * DPM * ep_g(IJK) * VRELA)
if (PE .eq. 1) then
write(*,*) "Mug=", Mug
write(*,*) "DPM=", DPM
write(*,*) "ep_g(", IJK, ")=", ep_g(IJK)
write(*,*) "VRELA=", VRELA
write(*,*) "NUM=", NUM
write(*,*) "DENOM=", DENOM
if (DENOM .EQ. 0) then
if (PE .eq. 1) write(*,*) "WE HAVE A PROBLEM HERE!"
else
if (PE .eq. 1) write(*,*) "FRACTION=", NUM/DENOM
N_Fd = NUM/DENOM
end if
Output:
DEM NITs: 2 Total PIP: 9240
Mug= 1.8499999999999999E-005
DPM= 0.0000000000000000
ep_g( 0 )= 1.1931191281420263E-319
VRELA= 0.0000000000000000
NUM= 0.0000000000000000
DENOM= 0.0000000000000000
WE HAVE A PROBLEM HERE!
Mug= 1.8499999999999999E-005
DPM= 0.0000000000000000
ep_g( 0 )= 1.1931191281420263E-319
VRELA= 0.0000000000000000
NUM= 0.0000000000000000
DENOM= 0.0000000000000000
WE HAVE A PROBLEM HERE!
Mug= 1.8499999999999999E-005
DPM= 1.1999999999999999E-003
ep_g( 815 )= 0.55539130995879105
VRELA= 1.2602140033530917
NUM= 1.8624048638473306E-004
DENOM= 1.4644261762803337E-007
FRACTION= 1271.7642541584919
Mohsen - I see that Jeff Dietiker just fixed this today. We will be releasing an updated 21.1.1 version soon, containing this bug fix. Thanks for your patience.
- Charles
@mohsenclick : If you donāt want to wait for 21.1.1 you can try applying this patch (and help us test)
8c3c6dea9f2c0e43deb79f03bd10ad63d9cc1c42.diff.txt (735 Bytes)
- Charles
@mohsenclick Please test the above fix if you get a chance. I am not sure why it would run for a few minutes before crashing in your case. It was failing immediately for me.
@jeff.dietiker @cgw Thank you so much! I will apply the patch and let you know if anything happens again.
@mohsenclick - FYI, the 21.1.1 version released today adds --use-hwthread-cpus
to the mpirun
command line, thanks for the suggestion.