Run error on HPC resources

jschirck · September 5, 2022, 8:36pm

BUG REPORT

Type of issue
The solver crashes with error: “There are no standard fluid cells in the computation.”

Description
There are screenshots of the error in the zipped folder of the error. The simulations are being run from the command line on HPC resources. The error occurs with both TFM and DEM. The error occurs for tutorials and the provided .mfx file.

Attempts to fix the issue
The first thing and most important thing to note is this simulation has been run on one HPC cluster, but does not run on a different (the current) HPC cluster. This makes me believe it has to do with the compiling process of the modules used. However, there are no errors in the compiling process. I have tried both versions 22.2.1 and 21.4 of MFiX (same error occurs for both). For modules, I load in openmpi, gcc, conda, and cmake. I have also tried to run 2 different tutorial simulations (DEM, 3D hopper and drum), both of which send the same error. Therefore I have concluded it is not the .mfx file which is the issue. Also, this .mfx file calls for UDFs and a modified solids conductivity, which requires source code modifications. I have deleted both the UDFs and modified source code files to make sure that is not the problem, still no luck. Like I already said, I think it has to do with the modules/compiling, but I do not know what to try next. Thank you in advance for your time.

Attach project files
Silica_Error.zip (50.3 KB)
I hope I have attached the correct files. When running through the command line, I do not know how to create a bug report, so I think I attached everything. I would be happy to provide more if necessary.

Jason Schirck

onlyjus · September 6, 2022, 4:28pm

Have you tried to run it in serial on the cluster that does not work?

Looks like you are using a spack built gcc-8.4:
/nopt/nrel/apps/base/2020-05-12/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/gcc-8.4.0
and openmpi 4.1.0:
/nopt/nrel/apps/openmpi/4.1.0-gcc-8.4.0-j15/bin/mpif90

When I use spack, the openmpi build is also built with spack. Paths just look a little goofy to me.

jschirck · September 6, 2022, 8:08pm

I have not run in serial before, but I typed into the command line: ./mfixsolver -f silica.mfx and got a very long error message beginning with the image shown below.

Sorry I do not know much about the modules (or modules in general). I just know which ones I normally have to load by using: module load openmpi, etc. Maybe it was the order I loaded them in? Or maybe I just need to load openmpi and gcc gets lumped in?

Thank you for responding!
Jason Schirck

cgw · September 7, 2022, 12:27pm

You got a core file, can you get a stack trace?

$ gdb ./mfixsolver core
(gdb) where

then copy/paste the result here (as text, not screenshot).

Optionally, use the “logging” feature in gdb to avoid having to copy/paste -

(gdb) set logfile "/tmp/x.txt"
(gdb) set logging enabled
(gdb) where

then attach the x.txt file.

Thanks!

jschirck · September 7, 2022, 2:14pm

Hi Charles,

There appears to be no stack.
gdb_log.txt (10 Bytes)

I was puzzled at first when I saw this. I tried deleting all of the cmake files and solver files and recompiling and re-running. I got the same “No fluid cell error”, and I got no stack again as well.

Thanks for your help!
Jason

cgw · September 7, 2022, 4:34pm

From the screenshot, it looks like we’re never getting past mpi_init - this is failing at startup, before any real MFiX-specific code is executing - so it looks like potentially a problem with your mpi setup.

I would try to build an run a trivial MPI “Hello world” program - for example

tools/mpi/hello_world.f90

hello_world.f90 (1.1 KB)

jschirck · September 7, 2022, 10:32pm

Update on the serial run: I was able to get the simulation running normally in serial with the help of a colleague, but the parallel runs have the same error.

Update on the mpi test: It looks like mpi is working alright.

Here is my compile command line I am using. Do you see a problem with this maybe?
cmake ~/MFiX_Source/mfix-22.2.1 -DCMAKE_BUILD_TYPE=Release -DENABLE_MPI=1 -DCMAKE_Fortran_COMPILER=mpif90 -DCMAKE_Fortran_FLAGS=“-O2”

jschirck · September 8, 2022, 9:37pm

Update: MFiX is now working in parallel if I use Intel modules. Specifically, I loaded in comp-intel/2020.1.217, intel-mpi/2020.1.217, and cmake/3.18.2. With these modules, I had to modify the compile line to: cmake ~/MFiX_Source/mfix-22.2.1 -DCMAKE_BUILD_TYPE=Release -DENABLE_MPI=1

Comment: I believe it is an issue with the version of gcc or openmpi. On the previous cluster I was using gcc/6.3.0 and openmpi/2.1.6, which ran the simulations fine. On the current cluster, the versions available are gcc/8.4.0 and openmpi/4.1.0, which do not seem to work. A colleague of mine tested gcc/7.5 and said it works, so maybe versions higher than 7.5 have issue?

Thank you for everyone’s help. I found a work around, but maybe this will spark a conservation and someone will found another solution.