MFiX suddenly stops in HPC cluster

Hello,

I have been trying to run a relatively simple CGP case in my HPC cluster. However, after running normally, it suddenly crashes. I have been using the same cluster for long TFM cases without any problem, but this case seems to fail every time (also with normal DEM). The case has a usr1.f that just varies the velocity along time to measure umf (it does not the source of the issue because I have tried without it, but I attach it in any case).

Please see attached the usr1.f file, the .mfx file and the slurm error message. The .LOG file does not provide any error message and the output file just says:

[1660042501.009526] [node9:413668:0] sock.c:344 UCX ERROR recv(fd=109) failed: Connection reset by peer
[1660042501.009861] [node9:413670:0] sock.c:344 UCX ERROR recv(fd=44) failed: Connection reset by peer

slurm.node9.1643.txt (2.1 KB)
umf_biomass.mfx (17.6 KB)
usr1.f (2.7 KB)

Could you please help me identify the source of my problem?

Thank you!

When doe it fail (first iteration, after x simulation time)?
What is your parallel decomposition?
Do you see any abnormal flow behavior just before it crashes?

Hello Jeff,
thank you for your answer.
It fails after some simulation time, never the same, but between 0.1 s and 0.2 s. I don’t see any abnormal behavior of the simulation, it just suddenly stops and the messages I sent you are written in the log file.
My parallel decomposition is currently 5x4x1. It has also failed with higher number of nodes. In fact, when decomposing with 8x5x1 it fails seeding the particles.
Thank you for your support!

Hello again,

just to specify a bit more. I am using DMP decomposition. I tried SMP but it was not working at all. I am always using DMP for TFM cases in the cluster without any problem.

Thanks!

I am not sure, it runs fine with me in SMP and DMP, including 8x5x1 partition. I use

gnu/8.4.0
openmpi/4.0.3_gnu8.4
cmake/3.19.1

modules.

Hello Jeff,
thank you for your answer. Please see below my compilation options. Do you see something anomalous? Do you think it is worth it to reinstall some of the versions of gnu or openmpi?
Thanks!

– Setting build type to ‘RelWithDebInfo’ as none was specified.
– MFIX build settings summary:
– Build type = RelWithDebInfo
– CMake version = 3.24.0
– Fortran compiler =
– Fortran flags =
– ENABLE_MPI = 1
– ENABLE_OpenMP = OFF
– ENABLE_CTEST = OFF
– ENABLE_COVERAGE = OFF
– The Fortran compiler identification is GNU 8.4.1
– Detecting Fortran compiler ABI info
– Detecting Fortran compiler ABI info - done
– Check for working Fortran compiler: /usr/bin/f95 - skipped
– Performing Test ffpe_trap
– Performing Test ffpe_trap - Success
– Performing Test ffpe_summary
– Performing Test ffpe_summary - Success
– Found MPI_Fortran: /usr/local/lib/libmpi_usempif08.so (found version “3.1”)
– Found MPI: TRUE (found version “3.1”)
– Found Git: /home/edu/.conda/envs/mfix-22.2.1/bin/git (found version “2.37.2”)

@edcanop -
This is almost certainly an issue with openMPI on your cluster. The “sock.c UCX error” message is coming from openmpi.

https://www.google.com/search?q=sock.c+"UCX+ERROR"+"Connection+reset+by+peer"

I suggest you work with your systems people to see if openmpi can be upgraded.

Also note that gfortran 8 is 5 years old at this point, although it works I suggest upgrading the compiler as well (all gfortran versions up to the current version of 12 work, and newer compiler versions produce better code).

– Charles

Hello,

thank you for your help. I will take a look at the openmpi and gfortran versions.

Thank you!