Segmentation fault - invalid memory reference in TFM simualtion

mohsenclick · March 18, 2021, 4:11pm

Thanks Charles! I will continue as you mentioned. I hooe it can be solved soon. I will also update you if anything new happens.

mohsenclick · March 19, 2021, 4:36pm

@cgw
Charles,
As an update, I started to build my case based on the MFiX 3D FB tutorial and I modified that simulation step by step to see what causes the error. It would be definitely said it occurs due to using the Srivastava frictional model.
I have changed the model to the Schaeffer model without changing the ep_star value and it is running for 14 hours without any issues.

I also checked the grid size effect and I think that the error is not related to the grid size.

Hope this helps.

mohsenclick · March 24, 2021, 4:12pm

@cgw
Attached is the case that I ran for several days on DMP using SCHAEFFER model without any issue. Yet, changing the model to SRIVASTAVA contributes to the segmentation fault error.fluid_bed_tfm_3d.mfx (11.4 KB)

xjwuyuhao · April 1, 2021, 8:54am

Hi Charles and mohsen, I met the same problem.dem_3d_LKTJH_Muller_Gidaspow_2021_4_1_vsmpi.zip (26.2 MB)
No matter using GUI or command line with 10 processes, an error will be reported after running about 0.017 seconds.

In command line, the error appears

In GUI, the error appears

But with 1 process, everything is fine, I have run 10 seconds.
I hope this case is helpful for you.

Best regards
xjwuyuhao
2021.04.01

cgw · April 12, 2021, 3:50pm

Yuhao -

I tried running the LKTJH_Muller_Gidaspow_vsmpi model you attached above - I recompiled it to run on my machine and I got a different error:


Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

#3  0x7f6f358a17d3 in write_gran_temp_
	at /tmp/dem_3d_LKTJH_Muller_Gidaspow_2021_4_1_vsmpi/usr0_des.f:233
#4  0x7f6f358a18af in usr0_des_
	at /tmp/dem_3d_LKTJH_Muller_Gidaspow_2021_4_1_vsmpi/usr0_des.f:27
#5  0x7f6f35e8f315 in __des_time_march_MOD_des_time_init

The line is:

                  N_Fd = (-VOLPARTICLE * P_Grad_diff + DgA * VRELA) / &
                  (3.0d0 * pi * Mug * DPM * ep_g(IJK) * VRELA)

it’s probably a simple zero-division error.

cgw · April 12, 2021, 3:52pm

Hi Mohsen, I’m looking at this issue again, so far I have not been able to reproduce it on my machine (with Srivastava friction model). How long did you have to let it run before you saw the error?

mohsenclick · April 12, 2021, 4:03pm

Hi, It happened in less than 30 minutes. I think you got this error according to your posts too. I can re-run it if it is required?

cgw · April 12, 2021, 4:06pm

I got the error with the “laverman” file but not (yet) with the “fluid_bed_tfm_3d”.

mohsenclick · April 12, 2021, 4:10pm

Let me run it again.

mohsenclick · April 12, 2021, 4:39pm

I re-run it on my machine several minutes ago and I am waiting for the error. Also, I changed the frictional model from SCHAEFFER to SRIVASTAVA in the same “fluid_bed_tfm_3d” that is running for more than 2 weeks on a cluster and it reproduced the error in less than a minute.

mohsenclick · April 12, 2021, 4:59pm

I reproduced the error again using the “fluid_bed_tfm_3d” at Time = 0.02474:

Timestep walltime, fluid solver: 21.853 s
Time = 0.24742E-01 Dt = 0.50000E-03
Nit HYDRO THETA Max res
1 0.4 3.4E-02 P1
At line 43 of file fun_avg.inc
Fortran runtime error: Index ‘987654321’ of dimension 1 of array ‘fx’ above upper bound of 123

Error termination. Backtrace:
#0 0x7f7f4da8cd01 in ???
#1 0x7f7f4da8d849 in ???
#2 0x7f7f4da8dec6 in ???
#3 0x5598f4e91066 in __fun_avg_MOD_avg_x_e
at /home/mzarepour/Documents/MFiX/build/model/fun_avg.inc:43
#4 0x5598f525a716 in __calc_s_ddot_mod_MOD_get_neighbor_vel_at_wall
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/calc_s_ddot_s.f:630
#5 0x5598f527d59c in __calc_s_ddot_mod_MOD_calc_s_ddot_s
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/calc_s_ddot_s.f:20
#6 0x5598f5224ce6 in __calc_gr_boundary_MOD_calc_grbdry
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/calc_grbdry.f:148
#7 0x5598f504236f in __source_u_s_mod_MOD_jj_bc_u_s
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/source_u_s.f:1072
#8 0x5598f504cedf in __source_u_s_mod_MOD_source_u_s_bc
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/source_u_s.f:792
#9 0x5598f505d842 in __source_u_s_mod_MOD_source_u_s
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/source_u_s.f:466
#10 0x5598f4ffb257 in u_m_star
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/solve_vel_star.f:267
#11 0x5598f4ff6e20 in __solve_vel_star_mod_MOD_solve_vel_star
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/solve_vel_star.f:134
#12 0x5598f4e9ad68 in _iterate_MOD_do_iteration
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/iterate.f:255
#13 0x5598f4d82f39 in run_fluid
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/mfix.f:188
#14 0x5598f4d82c4b in run_mfix
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/mfix.f:142
#15 0x5598f4d83724 in mfix
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/mfix.f:315
#16 0x5598f4d83e4a in main
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/mfix.f:269

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[44185,1],8]
Exit code: 2

xjwuyuhao · April 15, 2021, 12:51pm

Hi, cgw. Thanks for your help. However, before the line you give, I have imposed restrictions:

Pi and DPM aren’t equal to zero. Mug equals to Mu_g0 given in GUI, which is not zero as well. To VRELA,

it isn’t equal to zero. In the formula for calculating N_Fd, ep_g(IJK) is asked to .GT. zero.
So the parameters of the denominator are all greater than zero. I think it’s not a zero-division error.
What’s more, if the case runs with 1 process, everything is fine. If it’s a zero-division error, I consider that it should not run with 1 process.
Best regards
xjwuyuhao
2021.04.15

cgw · April 20, 2021, 8:33pm

Yuhao - sorry for the delayed response - I’ve been traveling.

When in doubt, print more numbers out. I modified your code to compute the numerator and denominator of the fraction separately. It’s sometimes helpful to break up computations like this, so you can just test if the denominator is zero before attempting to divide. I think that your checks for zero are not sufficient, since if either DPM or VRELA are 0, you get a 0 in the denominator. Instead of trying to check every factor, just compute the denominator and check it, like so:

        if((ep_s(IJK,1) .GT. ZERO) .AND. (Mug .GT. ZERO) &
            .AND. (ep_g(IJK) .GT. ZERO)) then
              !if (DEAD_CELL_AT(I,J,K)) cycle
              N_Power = -VOLPARTICLE * P_Grad_diff * VPARTICLE + &
              DgA * VRELA * VPARTICLE
              write(*,*) "Mug=", Mug
              write(*,*) "DPM=", DPM
              write(*,*) "ep_g(", IJK, ")=", ep_g(IJK)
              write(*,*) "VRELA=", VRELA
              NUM =  (-VOLPARTICLE * P_Grad_diff + DgA * VRELA)
              write(*,*) "NUM=", NUM
              DENOM = (3.0d0 * pi * Mug * DPM * ep_g(IJK) * VRELA)
              write(*,*) "DENOM=", DENOM
              if (DENOM .EQ. 0) then
                   write(*,*) "WE HAVE A PROBLEM HERE!"
              else
                   write(*,*) "FRACTION=", NUM/DENOM
                   N_Fd = NUM/DENOM
              end if

Running this I get:

 Mug=   1.8499999999999999E-005
 DPM=   1.1999999999999999E-003
 ep_g(         815 )=  0.55539130995879105     
 VRELA=   1.2600606582530112     
 NUM=   1.8629709923230549E-004
 Mug=   1.8499999999999999E-005
 DPM=   0.0000000000000000     
 ep_g(           0 )=   1.1931191281420263E-319
 VRELA=   0.0000000000000000     
 NUM=   0.0000000000000000     
 DENOM=   0.0000000000000000     
 WE HAVE A PROBLEM HERE!

VRELA is a square root of a sum of squares, so it’s non-negative, but I don’t see why you say it’s nonzero. And there’s not a nonzero test for DPM anywhere that I can see.

You still can’t divide 0 by 0 and I don’t think the checks you had were sufficient to prevent this.

I hope this helps, please let me know!

Charles

cgw · April 20, 2021, 8:42pm

That output is a little confusing since the different processes are all writing to the same output and it’s getting mixed together. Here’s an improved version, where we only write data on processing element (PE) number 1:

        if((ep_s(IJK,1) .GT. ZERO) .AND. (Mug .GT. ZERO) &
            .AND. (ep_g(IJK) .GT. ZERO)) then
              !if (DEAD_CELL_AT(I,J,K)) cycle
              N_Power = -VOLPARTICLE * P_Grad_diff * VPARTICLE + &
              DgA * VRELA * VPARTICLE
              NUM =  (-VOLPARTICLE * P_Grad_diff + DgA * VRELA)
              DENOM = (3.0d0 * pi * Mug * DPM * ep_g(IJK) * VRELA)
              if (PE .eq. 1) then
                  write(*,*) "Mug=", Mug
                  write(*,*) "DPM=", DPM
                  write(*,*) "ep_g(", IJK, ")=", ep_g(IJK)
                  write(*,*) "VRELA=", VRELA
                  write(*,*) "NUM=", NUM
                  write(*,*) "DENOM=", DENOM
                  if (DENOM .EQ. 0) then
                      if (PE .eq. 1)  write(*,*) "WE HAVE A PROBLEM HERE!"
                   else
                       if (PE .eq. 1) write(*,*) "FRACTION=", NUM/DENOM
                       N_Fd = NUM/DENOM
                  end if

Output:

DEM NITs: 2   Total PIP: 9240
 Mug=   1.8499999999999999E-005
 DPM=   0.0000000000000000     
 ep_g(           0 )=   1.1931191281420263E-319
 VRELA=   0.0000000000000000     
 NUM=   0.0000000000000000     
 DENOM=   0.0000000000000000     
 WE HAVE A PROBLEM HERE!

 Mug=   1.8499999999999999E-005
 DPM=   0.0000000000000000     
 ep_g(           0 )=   1.1931191281420263E-319
 VRELA=   0.0000000000000000     
 NUM=   0.0000000000000000     
 DENOM=   0.0000000000000000     
 WE HAVE A PROBLEM HERE!

 Mug=   1.8499999999999999E-005
 DPM=   1.1999999999999999E-003
 ep_g(         815 )=  0.55539130995879105     
 VRELA=   1.2602140033530917     
 NUM=   1.8624048638473306E-004
 DENOM=   1.4644261762803337E-007
 FRACTION=   1271.7642541584919

mohsenclick · April 20, 2021, 8:46pm

@cgw
I wonder if you have found out the reason for the TFM segmentation fault?

cgw · April 20, 2021, 10:57pm

Mohsen - I see that Jeff Dietiker just fixed this today. We will be releasing an updated 21.1.1 version soon, containing this bug fix. Thanks for your patience.

Charles

cgw · April 21, 2021, 2:44am

@mohsenclick : If you don’t want to wait for 21.1.1 you can try applying this patch (and help us test)

8c3c6dea9f2c0e43deb79f03bd10ad63d9cc1c42.diff.txt (735 Bytes)

Charles

jeff.dietiker · April 21, 2021, 1:40pm

@mohsenclick Please test the above fix if you get a chance. I am not sure why it would run for a few minutes before crashing in your case. It was failing immediately for me.

mohsenclick · April 23, 2021, 5:20pm

@jeff.dietiker @cgw Thank you so much! I will apply the patch and let you know if anything happens again.

cgw · April 27, 2021, 8:24pm

@mohsenclick - FYI, the 21.1.1 version released today adds --use-hwthread-cpus to the mpirun command line, thanks for the suggestion.

Segmentation fault - invalid memory reference in TFM simualtion

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.