Segmentation fault - invalid memory reference in TFM simualtion

I was simulating a pseudo-2D fluidized bed using TFM method with 20.2.1 version. I got this error:
Time = 0.72876E-01 Dt = 0.10000E-02
Nit P0 U0 V0 P1 U1 V1 G1 Max res
1 2.3E-09 1.2E-05 1.1E-03 2.2E-05 3.1E-05 2.6E-03 1.5E-02 G1

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x7fab124bcd01 in ???
#1 0x7fab124bbed5 in ???
#2 0x7fab1218420f in ???
#3 0x5555f311c137 in __fun_avg_MOD_avg_x_e
at /home/mzarepour/Documents/MFiX Simulations/Laverman-Based on Reza Article/Laverman-2D-1/build/model/fun_avg.inc:43
#4 0x5555f32ba06f in __calc_s_ddot_mod_MOD_get_neighbor_vel_at_wall
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/calc_s_ddot_s.f:630
#5 0x5555f32c4b03 in __calc_s_ddot_mod_MOD_calc_s_ddot_s
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/calc_s_ddot_s.f:150
#6 0x5555f32a7f4c in __calc_gr_boundary_MOD_calc_grbdry
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/calc_grbdry.f:148
#7 0x5555f31d77db in __source_u_s_mod_MOD_jj_bc_u_s
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/source_u_s.f:1072
#8 0x5555f31dac03 in __source_u_s_mod_MOD_source_u_s_bc
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/source_u_s.f:792
#9 0x5555f31dffed in __source_u_s_mod_MOD_source_u_s
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/source_u_s.f:466
#10 0x5555f31c141f in u_m_star
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/solve_vel_star.f:268
#11 0x5555f31beffe in __solve_vel_star_mod_MOD_solve_vel_star
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/solve_vel_star.f:134
#12 0x5555f3120c0a in _iterate_MOD_do_iteration
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/iterate.f:255
#13 0x5555f30acdb3 in run_fluid
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/mfix.f:188
#14 0x5555f30acb8c in run_mfix

at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/mfix.f:142
#15 0x5555f30ad4d8 in mfix
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/mfix.f:315
#16 0x5555f30adb71 in main
at /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/mfix.f:269

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun noticed that process rank 6 with PID 0 on node simlab05 exited on signal 11 (Segmentation fault).

What has caused the error?

Interesting. Can you provide the job files? Clicking “Submit bug report” in the main MFiX menu will create a zip file you can upload.

Generally segmentation faults mean that some array is being accessed beyond its allocated limits, often at a wall or boundary. We can probably tell you more if you provide the project files.

  • Charles

Hi Charles,
Attached is the bug report.laverman-2d-1_2021-03-15T154735.630975.zip (61.5 KB)

Changing the domain decomposition combination, or total number of processes in use, would postpone the error for a while (few minutes). Also, I have used mpirun command with only 1 process and it has run for hours without any issue.

So, I think it is related to the domain decomposition.

BTW, I am running the case on Ubuntu 20.04.1 LTS with gnu/9.3.0 and openmpi/4.0.4.

@cgw @jmusser I faced the same error after hours using only one process! So what do you think the problem is?

Hi - I’m running your case now, I’ll let you know when I have some answers.

  • Charles

Mohsen - I haven’t reproduced the seg fault yet, but I did notice one thing -

your job will take over 500 days to run, at least with one process. Are you planning to run this on a large cluster?

Actually, I was just trying to see if I faced the same problem on one process! And I did. I run the file on cluster and it stopped by the same error! I used 14 and 21 processes on the cluster and it took less than 30 minutes to get the error.
Also since I run it using Terminal and without GUI, I use the following build and compile syntax:

build_mfixsolver -DCMAKE_BUILD_TYPE=Debug --batch --dmp -j -DMPI_Fortran_COMPILER=mpifort -DCMAKE_Fortarn_FLAGS="-o3 -g -fcheck=all" mpifort

mpirun --use-hwthread-cpus -np 14 ./mfixsolver -f Laverman-2D-1.mfx

looks like a typo - should be Fortran, not Fortarn - is that the actual command you used?

Ooops! That was what I used! Do you think it could cause the issue? The build was successful.

No, I don’t think that’s the root cause, but it shows you are not building the solver quite the way you think you are. The build_mfixsolver script should be a little smarter and complain about invalid flags, but I see that it doesn’t … we will try to address this in a future release.

1 Like

I have corrected the typo. However I faced the error with an extra explanation:

Now we’re getting somewhere, the -fcheck=all is doing its job! And that number - 987654321 is awfully suspicious -
In model/param1_mod.f

      INTEGER, PARAMETER :: UNDEFINED_I = 987654321

this is a special value used to represent “Undefined” or “None”

We will follow up on this and get back to you ASAP.

Also - as one of the main developers of the MFiX GUI, I’d like to know why you are not using it - the GUI was designed to make things easier for our users, including building the solver - if it is not easier for you, or there’s some obstacle to using it, we’d like to know why!

  • Charles

Thanks Charles,
I am looking forward to hearing from you.
I usually do not use GUI since I need to use “–use-hwthread-cpus” in the compile command on my own machine to employ all of the available cores. On GUI, I was not successful to do that. Also, the university clusters and “Compute Canada” do not support graphical interfaces.

Mohsen -
--use-hwthread-cpus is a runtime flag for mpirun, not a compile flag. But you are right, it’s needed to access hyperthreads. We should add support for this flag in the GUI.

It’s easy to reproduce the crash - it happens whenever I run this case in DMP mode. It runs fine if I compile it and run it without openmpi. Several other things are unusual when I run in DMP mode, e.g. the “time remaining” estimates seem wrong. I don’t have a fix yet but it looks like something is not getting initialzed correctly when we run in DMP mode.

Thanks for following up this case. There are some other minor issues with the GUI and I will list them for you soon. I hope you can find the reason of that error soon.

Charles -
Now I am trying to run my case using SMP. As I know, the SMP usually run parts of the program like the do loops in parallel, which seems that it performs weaker than DMP with domain decomposition, and in many parts of the code, the omp parts have changed to comment lines.

Why at all should we use SMP, when we are able to use DMP on a single CPU with multiple threads?

I mean do you have a comparison if we run the same simulation on the same number of threads on a single CPU using SMP and DMP, how the run times change?

@cgw Charles, As I mentioned before, I run the simulation using SMP (18 threads) and I got the same error after 2 hours of running time and at 0.4 sec of the simulation time. So, may be the problem is not related to DMP.

@cgw @onlyjus I changed the grid size to imax = 80, jmax = 186, kmax = 4 and used nodesi=1, nodesj=18 and nodesk=1 and again I got that error with a bit of change. Still something is being accessed beyond its allocated limits:
At line 617 of file /home/mzarepour/anaconda3/envs/mfix-20.2.1/share/mfix/src/model/calc_s_ddot_s.f
Fortran runtime error: Index ‘84’ of dimension 1 of array ‘im1’ above upper bound of 82

Error termination. Backtrace:
Is there any idea on how I can continue my simulations despite this problem?
Thank you!

Hi Mohsen -

In case it is not clear, you have found a bug in MFiX. We appreciate bug reports from our users.

I can reproduce the problem you reported running your job in DMP mode. We’re looking into the problem but we are not always able to fix bugs immediately. You may be able to get your jobs to run by continuing to adjust grid size, etc, but there’s no guarantee that this will work - there’s an underlying problem with the code that needs to be fixed. I will also try SMP mode, as you mentioned you had an issue there too.

I’ll let you know when we have fixed this, it should not be too long. Sorry for the inconvenience.

– Charles