DEM simulation fails without error message

julia.hartig · August 13, 2021, 2:00am

I’m not quite sure what’s going wrong with this cylindrical settling bed case (gas phase turned off), it consistently fails at about “Time: 0.22537E-02” with no error message. Things I have tried playing with:

Mesh settings. The aspect ratios and sizes don’t seem to be out of the ordinary. Also, STL normals are pointing inward. Increasing the mesh size has no effect on the failure time.
DEM settings. Changing collision parameters (friction coefficient, dtsolid_fac, etc.) doesn’t seem to help.

As far as I can tell, the problem has something to do with a handful of particles at the bottom of the simulation, but I can’t figure out why. I’m not sure what else to try and am open to any suggestions!
input_files.zip (1.3 MB)

cgw · August 13, 2021, 3:46pm

Hi Julia.

A few questions/comments -

What environment are you running this in? Typically these “fails without error message” problem are due to the limitations of the programming/debugging environment on Windows. The current version of MFiX prints a full stack trace for floating-point exceptions and segfaults on Linux platform but does not do so reliably on Windows. This turns out to be a limitation of GFortran, so we’re planning to switch from the GNU to the Intel FORTRAN compiler for Windows. I can offer some additional tips on debugging on Windows, if this is the case.

Running on Linux I can reproduce your error and I see the following segmentation fault:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
[...]
#3  0x7f28ff434760 in __desgrid_MOD_desgrid_neigh_build
        at /home/cgw/Work/NETL/mfix/model/des/desgrid_mod.f:1060
#4  0x7f28ff460beb in __neighbour_mod_MOD_neighbour
        at /home/cgw/Work/NETL/mfix/model/des/neighbour.f:56
#5  0x7f28ff30eab4 in __des_time_march_MOD_des_time_step
        at /tmp/HARTIG/des_time_march.f:221
#6  0x7f28ff4c6a29 in run_dem
        at /home/cgw/Work/NETL/mfix/model/mfix.f:211

In addition to your usr_* files, I see that you have modified versions of calc_force_dem.f, des_time_march.f, and drag_gs.f from the MFiX sources. I don’t understand the motivation for your changes, but if I move your calc_force_dem.f out of the way and rebuild, the model runs without errors.
You are using MFiX 20.4, you should upgrade to 21.2 whenever convenient.

– Charles

julia.hartig · August 13, 2021, 5:03pm

Charles,

I’m running this on Windows 10 in the MFIX GUI. I usually do run these on a Linux HPC with MPI, but I wasn’t seeing segfaults there, I just got errors like:

" ************************************************************************
From: DESMPI_UNPACK_PARCROSS:
Error 1000: Unable to match particles crossing processor boundaries.
Source Proc: 10 —> Destination Proc: 11
Global Particle ID: 52666

(PE 11) : A fatal error occurred in des routines
*.LOG file may contain other error messages
*************************************************************"
Maybe this was a result of the original segfault error? I switched to Windows simply because I didn’t want to keep waiting in the HPC queue for each test/it’s faster for me to make changes and rerun in the Windows GUI than in Linux bash when I’m just debugging code. Maybe that is bad practice given the gfortran issues you mentioned.

I can delete those modified subroutines, they’re left over from prior simulations where I wanted to assign user defined variables and monitor things like particle-particle coordination number.
I’ve been having a lot of issues with the 21.2 Windows GUI (which is where I set up simulations before sending to HPC) so that’s why I’ve been sticking to 20.4.2, but I will try to fix those soon.

cgw · August 13, 2021, 5:11pm

Here’s a little more debug info (from gdb on Linux): With your modified calc_force_dem.f the index lijk is going negative:

1060	               lneigh = dg_pic(lijk)%p(lpicloc)

(gdb) p lpicloc
$1 = 1
(gdb) p lijk
$2 = -732

What issues are you having with the 21.2 GUI? We would like to know about any problems are users are having with the new release. Thanks.

– Charles

julia.hartig · August 17, 2021, 3:41am

I think there’s something wrong with the qt plugin for me. I don’t see any errors when I open the 21.2 GUI, except this message in the Anaconda Command Prompt:
“mfixgui.tools.qt - WARNING - Style from settings not available, ignoring: windowsvista”

If I try to build the solver for example, I see some nonsense text, see example below. I installed via anaconda (not pip/tarball) if that helps.

Currently trying to learn how to use gdb so I can trace back the steps you used to diagnose this issue. If I compile with GNU Fortran on Linux (using “build_mfixsolver FCFLAGS=’-g -O0’ -j”) and then try jumping into gdb using ‘gdb mfixsolver’, gdb starts up successfully but gives the error:
‘./mfixsolver: not in executable format: File format not recognized’.
Is there an easy fix to this? Am I doing something completely wrong?

cgw · August 17, 2021, 1:57pm

Hi Julia - two-part reply:

(1) Regarding gdb - you’re not doing anything wrong - you’re on the right track. There’s just some extra complexity involved.

The mfix solver can be built in 2 different ways. In the “legacy” (or batch) mode, build_mfixsolver produces a normal Fortran binary executable.

When we run MFiX in the GUI there are some additional features - the ability to pause and modify settings and resume, and monitoring of job status - that are implemented in a Python process that uses HTTP to communicate with the GUI. In this mode, the Fortran MFiX code gets turned into a library which is “wrapped” in Python. So the mfixsolver you tried to run in GDB is actually a Python script (“pymfix”) - or more correctly a shell script that starts the Python script which imports the wrapped Fortran code. Let’s take a peek behind the curtains:

The file command on Linux, if you are not familiar, is very useful, it will attempt to deduce the type of file you specify. So when gdb says “not in executable format” you can see what file says:

% file mfixsolver
mfixsolver: symbolic link to /tmp/Absorption_column_2d/mfixsolver.sh

% file /tmp/Absorption_column_2d/mfixsolver.sh
mfixsolver.sh: POSIX shell script, ASCII text executable

% cat mfixsolver.sh
env   PYTHONPATH="/tmp/Absorption_column_2d":"/home/cgw/Work/NETL/mfix":${PYTHONPATH:+:$PYTHONPATH} /usr/lib/python-exec/python3.9/python -m mfixgui.pymfix "$@"

You can debug this module by starting Python itself in gdb (with the correct PYTHONPATH), then importing and running pymfix - at the GDB prompt you type run -m mfixgui.pymfix -f /path/to/PROJECT.mfx

but this is the slightly more difficult way of doing it - the easier way is to build the batch-mode solver that doesn’t have the Python extensions.

To do this, add --server=none to your build_mfixsolver command line, or if you are using the GUI, go into “Settings” on the main menu, click “Enable developer mode”, and then the build popup will show a new selector for “Interactive support”, set this to “None (batch)”

and then when you build you will get a directly debuggable executable:

% file mfixsolver
mfixsolver: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped

You can start this in GDB, but you have to specify the name of the mfix input file - if no name is specified, the solver will look for the “traditional” name of mfix.dat. So either call your input file mfix.dat, or, when you start the process in gdb, specify the path to the input file with the -f flag like so:

%  gdb mfixsolver
[...]
(gdb)  run

 **********************************************************************
 From: mfix.f
 Error 1000: Input data file does not exist: mfix.dat
Aborting.
 **********************************************************************

(gdb)  run  -f Absorption_column_2d.mfx 
[MFiX simulation starts running]

Another issue you will likely run into on repeated debugging runs is that MFiX will exit if the output files are already present. In that case you get an error that looks something like

ERROR open_files.f:206
File exists but RUN_TYPE=NEW
Cannot create file: ABSORPTION_COLUMN_2D.RES

and you simply have to delete the output files and try again (this is usually handled by the GUI)

Note that the batch-mode solver will not work correctly if you try to launch it from the GUI - the necessary communication between GUI and solver is not enabled, so the front end won’t be able to stop/pause the solver or track its progress. So, for normal use I do not recommend the batch solver. We have a goal of making debugging in the GUI easier (integrated debugger) and removing this batch/interactive distinction but that is still a long way off.

=============================================================

(2) The garbled text in the build output is really just the space character rendering as an “á” - this is a slightly embarrassing Unicode vs ASCII mixup which is happening on some platforms - we caught this right after the 21.2 release went out. It does not affect the compilation of the solver in any way. I have a fix for it, which will be present in 21.3 (Sep/Oct) or 21.2.1 if we decide to do a bugfix release before then. I think the “Style not available” message can be ignored. Please let me know if you see any other irregularities in the 21.2 GUI - we depend on feedback from users like you. Thanks!

– Charles

julia.hartig · August 17, 2021, 4:41pm

Charles,

(1) Thank you so much for your (always) incredibly detailed responses! Your explanation of alternative methods is very helpful I was suspicious from reading other forums that the mfixsolver script could be a bash/Python script instead of a binary executable, but I didn’t know what to do with that information.

I added the --server=none line in and was able to get an executable mfixsolver file, but when I try to run it in gdb with “run -f fileName.mfx”, I get a library error:

error while loading shared libraries: libgfortran.so.5: cannot open shared object file: no such file or directory

I used the batch build line you suggested (“build_mfixsolver FCFLAGS=’-g -O0’ --server=none -j”) with gcc/10.2.0 and gdb/10.1, if that makes a difference. It is worth noting that my “file mfixsolver” output is a bit different from yours; I don’t see a with debug_info section:

% file mfixsolver
mfixsolver: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, not stripped

(2) I see, the GUI does seem to compile alright and the UI is responsive so it otherwise seems to be working for me!

cgw · August 17, 2021, 6:07pm

Does the batch-mode mfixsolver run outside of GDB?

What does
ldd ./mfixsolver
say?

julia.hartig · August 17, 2021, 7:27pm

Charles,

As far as I can tell, yes - if I compile the job in legacy mode with --server=none and no FCFLAGS for debugging, I can enter ./mfixsolver on a compute node and the simulation runs with no errors.

See below for the ldd ./mfixsolver output.

cgw · August 17, 2021, 8:16pm

The line
libgfortran.so.5 => not found
is consistent with what GDB is saying - the Fortran library can’t be found. If ldd can’t find the dependent libs, then the process really should not be able to start. Are you running gdb on a compute node? Is it the same node where you ran build_mfixsolver? This looks like an environment problem, libraries in non-standard places, or something like that.

– Charles

julia.hartig · August 17, 2021, 8:40pm

Charles,

Interesting, okay. I usually run build_mfixsolver on what our HPC calls a “compile node” and then sbatch the job, which runs on a separate compute node. Interactive nodes on our HPC (where sinteractive can be run with gdb) are in a different partition than regular compute nodes. Maybe I should try running ldd on different nodes and see if the results are different? This sounds a bit out of my wheelhouse so I’ll try contacting our research computing staff and see if they can help. Thank you for the pointers!

cgw · August 17, 2021, 8:46pm

You should be able to just run gdb on the compile node. If that’s not possible you can set LD_LIBRARY_PATH to the right location - this may be happening as part of your batch submission. Generally everything will be easier for you if you can compile and debug on the same node. Starting the batch mfixsolver from inside gdb works on Windows too (conda install -c conda-forge gdb) - there’s just no core dumps or post-mortem debug available on Windows, but you can start the solver in the debugger and that should work.

– Charles

julia.hartig · August 17, 2021, 9:07pm

There must be something weird going on, because if I build on a compile node, it builds successfully, but then if I run ldd ./mfixsolver on that same compile node, I get the libgfortran.so.5 error again. Maybe only our compute nodes have access to libgfortran, to prevent students from running jobs on a compile node?

Edit: I was able run gdb successfully on a compile node, although I don’t understand how/why it’s able to run successfully with a missing library link.

jeff.dietiker · August 17, 2021, 9:26pm

Before digging deep into debugging, can you please confirm you want to run DEM on 50-micron-ish particles? This is going to be pretty slow.

I would recommend first to remove all udfs, turn off cohesion and keep the default des time step (set dtsolid_factor = 50) and see if the error still occurs. The error you see usually occur when particles are too soft and go through the walls. Increasing the spring stiffness sometimes helps.

julia.hartig · August 17, 2021, 9:35pm

Jeff,

Yes, unfortunately for our application, 50 micron is about the largest particle size of interest. Most of our powders are much finer (i.e. 1-10 micron). I’m hoping coarse graining will open up some new possibilities for speedup, but I haven’t tested those simulations yet.

I did try the des time step and turning off cohesion methods you just mentioned, unfortunately they didn’t fix the problem. As Charles identified, my modified calc_force_dem.f subroutine seemed to be the problem due to an index (lijk) going negative. We’ve since removed this and the simulation runs to completion just fine now. But I thought this would be a good opportunity to learn the deeper debugging tools like gdb, since we know now what went wrong.

julia.hartig · August 18, 2021, 1:43am

Charles,

Alright, I got it! Finally made it to the segfault at line 1060 in desgrid_mod that you mentioned above, so I think I have a functional workflow now. Thanks a lot for your help! I’ll have to go through some dbg tutorials to learn the basics.