SIGILL: Illegal instruction running MFiX in HPC

Hi everyone!

Recently, I wanted to submit my case to the HPC in my school, so firstly, I decided to run a simple case on the HPC to test. It is a very simple case about combustion of CH4:

Reaction.mfx (17.2 KB)
usr_rates.f (4.6 KB)

In the Ubuntu 24.01 of myself computer, the case goes well. Howvever, When I submit it to the HPC, it reports error like:

Program received signal SIGILL: Illegal instruction.

Backtrace for this error:

#1 0x5fa610 in __dgtsv_mod_MOD_dgtsv
** at /home/svu/e1454408/mfix-24.2.3/src/model/DGTSV.f:190**

The whole log of err is attached:

err.txt (32.8 KB)

In detailed, I build the solver from source in the HPC use the command below:

cmake … -DENABLE_MPI=1 -DMPI_Fortran_COMPILER=mpifort -DCMAKE_Fortran_FLAGS=“-O2 -g”
make -j

and I build it with cmake/3.20.6, gcc/7.3.0 and openmpi/4.0.0. The build of solver is success. However, when I submit the case, it returns err. The HPC of our school use the PBS system and I am not sure if my submit script is run. The sample script of mpi in the HPC recommended xe_2015, but I find xe_2015 could not configure the solver. The sample script and my submit script is attached:

my_submit.txt (570 Bytes)
sample_script_of_school.txt (902 Bytes)

How should I solve this problem? Thank you very much!

Hi @cifer

  1. gcc 7.3.0 is pretty old - is this the most recent compiler available on your HPC?

  2. Did you build the solver on the same type of machine as you are running it on? That is, did you build on your own Ubuntu system or on one of the HPC nodes?

  3. Can you please repeat the second step of the build: first do make clean then

make -j VERBOSE=1 |& tee build.log

and upload the build.log here.

Thanks.

Thank you very much!

  1. I tried gcc 9.2.0 on the HPC node (it is the latest version in the HPC of our school), and the configuration is successful, but when I used the command ‘make -j’, it reports error:

/home/svu/e1454408/mfix-24.2.3/src/model/dmp_modules/compar_mod.f:18:12:

** 18 | USE mpi**
** | 1**
Fatal Error: Cannot read module file ‘mpi.mod’ opened at (1), because it was created by a different version of GNU Fortran
compilation terminated.

  1. Yes, in fact, I built the solver by cmake/3.20.6, gcc/7.3.0 and openmpi/4.0.0 on the node of the HPC, and the built is successful.

  2. I tried what you suggested and this is the log:

build.txt (560.1 KB)

By the way, the error still existed when I submit the case.

Thank you very much again!

  1. If you switch compiler versions you should do a “make clean” before building with the newer compiler

  2. On the HPC nodes, what is the contents of cat /proc/cpuinfo? Is it possible that the node you built on is different from the worker nodes?

  1. Yes, in fact, I use the gcc/9.2.0 to compile in a new directory, before I compile it, there are only mfx file and usr_rates.f in the directory.

  2. This command lists the information of the 40 processors, the whole information is like this:

Cpuinfo.txt (51.8 KB)

In our school, we could only choose a hostname to connect the HPC, I am not very sure if the node I compile is the same as the worker nodes.

Thank you very much!

Ok, the mpi.mod incompatibility is due to the MPI module you have loaded on your HPC. Note that the MFiX Conda packages include mpi 5 and gfortran 14, so you don’t have to load any modules at all. Can you try building and running the solver without loading environment modules?

The “illegal instruction” looks like it’s coming from a difference between the build node and the run node… are you sure the host you are building on is the same type?

If all else fails, try this:

Edit the file $CONDA_PREFIX/share/mfix/src/model/CMakeLists.txt and remove (or comment out) this section:

if (APPLE)
  set (MARCH "")
else()
  if (DEFINED ENV{CONDA_BUILD})
     set (MARCH "-march=haswell")
  else()
     set (MARCH "-march=native")
  endif()
endif()
if (DEFINED ENV{MARCH})
  set (MARCH "-march=$ENV{MARCH}")
endif()

The HPC of our school will be undergoing a maintenance these days, so I will try next Monday.

In fact I am not very sure if the nodes of built and running are the same…The student just uses a hostname to connect the hpc, compile solver and submit the job script, the job script could specify queues, but not nodes.

Thank you for your reply!

Thank you for your attention to this issue!

This time, I did not attempt to build the solver directly from the source code, but built the solver from the conda environment without loading any modules. However, when submitting the script to run the solver, I must load the openmpi module, otherwise HPC cannot recognize the mpirun command.

Unfortunately, regardless of whether I edited the CMakeLists.txt as you suggested or not, after I submitted the script to run the solver, the HPC system always showed that my task was running, but there were no files in the working directory. This continued until the system automatically killed the task after the runtime reached walltime.

I will also try to communicate with the staff responsible for HPC at the school about this issue. Thank you very much!

That’s surprising. There’s an MPI implementation included with the conda module, so if the mfix environment is active when you submit the job, the mpirun from the conda mfix package should be found… unless somehow your batch system is set up to not copy over the PATH environment variable that was in effect when the job was submitted.

I tried to activate the mfix environment in the script submitted to the HPC, and it seems works, However, it reports new error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x2aec39be42ef in ???
#0 0x2ac3efafc2ef in ???
#1 0x55608fc172e7 in __stiff_chem_tfm_MOD_mapmfixtoode_tfm
at /home/svu/e1454408/miniforge3/envs/mfix-24.2.3/share/mfix/src/model/chem/stiff_chem_tfm.f90:384
#1 0x55f2cf9742e2 in __stiff_chem_tfm_MOD_mapmfixtoode_tfm

The whole err is as below:
err1.txt (2.9 KB)

This case works well in myself Ubuntu, so it looks strange. Do you have solutions about it?

Thank you very much!

This is interesting. Can you upload the updated project files so I can take a look? What is the domain decompositio you are using? The “Reaction.mfx” at the beginning of the thread has nodesi=nodesj=nodesk=1

OK, in fact, the mfx file is the same as before:

Reaction.mfx (16.8 KB)

but my script submitted to HPC is as below:

#!/bin/bash
#PBS -N MFiX_test
#PBS -q parallel24
#PBS -l select=1:ncpus=24:mpiprocs=24:mem=128GB
#PBS -l walltime=23:00:00
#PBS -o /home/svu/e1454408/MFiX_work/Test_CH4/out.log
#PBS -e /home/svu/e1454408/MFiX_work/Test_CH4/err.log

cd $PBS_O_WORKDIR; ## this line is needed, do not delete and change.
np=$( cat ${PBS_NODEFILE} |wc -l ); ### get number of CPUs, do not change

cd /home/svu/e1454408/MFiX_work/Test_CH4
source /home/svu/e1454408/miniforge3/etc/profile.d/conda.sh
conda activate mfix-24.2.3
mpirun -np 24 ./mfixsolver_dmp -f Reaction.mfx NODESI=6 NODESJ=4 NODESK=1

I used NODESI=6 NODESJ=4 NODESK=1 in the script, maybe I need to change it in the mfx file?

In addition, if I change nodesi, nodesj and nodesk in the mfx file, the error also exists.

I ran this on our HPC cluster with the 6x4x1 domain decomposition and did not get any errors. How did you build the solver module?

As you mentioned, I built the solver without loading any modules, just activate the mfix environment and built the solver by command ‘build_mfixsolver --batch --dmp -j’ in the command line.

In fact, the solver successfully ran for some time steps, and it reported the error at t=0.00216s. By the way, I tried another paid HPC and everything goes well, so I think the key issue is HPC of our school, I will communicate with the staff in our school about this problem.

Thank you very much!