SIGILL: Illegal instruction running MFiX in HPC

Cifer · October 28, 2024, 12:27pm

Hi everyone!

Recently, I wanted to submit my case to the HPC in my school, so firstly, I decided to run a simple case on the HPC to test. It is a very simple case about combustion of CH4:

Reaction.mfx (17.2 KB)
usr_rates.f (4.6 KB)

In the Ubuntu 24.01 of myself computer, the case goes well. Howvever, When I submit it to the HPC, it reports error like:

Program received signal SIGILL: Illegal instruction.

Backtrace for this error:

…

#1 0x5fa610 in __dgtsv_mod_MOD_dgtsv
** at /home/svu/e1454408/mfix-24.2.3/src/model/DGTSV.f:190**

The whole log of err is attached:

err.txt (32.8 KB)

In detailed, I build the solver from source in the HPC use the command below:

cmake … -DENABLE_MPI=1 -DMPI_Fortran_COMPILER=mpifort -DCMAKE_Fortran_FLAGS=“-O2 -g”
make -j

and I build it with cmake/3.20.6, gcc/7.3.0 and openmpi/4.0.0. The build of solver is success. However, when I submit the case, it returns err. The HPC of our school use the PBS system and I am not sure if my submit script is run. The sample script of mpi in the HPC recommended xe_2015, but I find xe_2015 could not configure the solver. The sample script and my submit script is attached:

my_submit.txt (570 Bytes)
sample_script_of_school.txt (902 Bytes)

How should I solve this problem? Thank you very much!

cgw · October 28, 2024, 2:49pm

Hi @cifer

gcc 7.3.0 is pretty old - is this the most recent compiler available on your HPC?
Did you build the solver on the same type of machine as you are running it on? That is, did you build on your own Ubuntu system or on one of the HPC nodes?
Can you please repeat the second step of the build: first do make clean then

make -j VERBOSE=1 |& tee build.log

and upload the build.log here.

Thanks.

Cifer · October 29, 2024, 2:32am

Thank you very much!

I tried gcc 9.2.0 on the HPC node (it is the latest version in the HPC of our school), and the configuration is successful, but when I used the command ‘make -j’, it reports error:

/home/svu/e1454408/mfix-24.2.3/src/model/dmp_modules/compar_mod.f:18:12:

** 18 | USE mpi**
** | 1**
Fatal Error: Cannot read module file ‘mpi.mod’ opened at (1), because it was created by a different version of GNU Fortran
compilation terminated.

Yes, in fact, I built the solver by cmake/3.20.6, gcc/7.3.0 and openmpi/4.0.0 on the node of the HPC, and the built is successful.
I tried what you suggested and this is the log:

build.txt (560.1 KB)

By the way, the error still existed when I submit the case.

Thank you very much again!

cgw · October 29, 2024, 12:50pm

If you switch compiler versions you should do a “make clean” before building with the newer compiler
On the HPC nodes, what is the contents of cat /proc/cpuinfo? Is it possible that the node you built on is different from the worker nodes?

Cifer · October 30, 2024, 6:29am

Yes, in fact, I use the gcc/9.2.0 to compile in a new directory, before I compile it, there are only mfx file and usr_rates.f in the directory.
This command lists the information of the 40 processors, the whole information is like this:

Cpuinfo.txt (51.8 KB)

In our school, we could only choose a hostname to connect the HPC, I am not very sure if the node I compile is the same as the worker nodes.

Thank you very much!

cgw · October 30, 2024, 1:23pm

Ok, the mpi.mod incompatibility is due to the MPI module you have loaded on your HPC. Note that the MFiX Conda packages include mpi 5 and gfortran 14, so you don’t have to load any modules at all. Can you try building and running the solver without loading environment modules?

The “illegal instruction” looks like it’s coming from a difference between the build node and the run node… are you sure the host you are building on is the same type?

If all else fails, try this:

Edit the file $CONDA_PREFIX/share/mfix/src/model/CMakeLists.txt and remove (or comment out) this section:

if (APPLE)
  set (MARCH "")
else()
  if (DEFINED ENV{CONDA_BUILD})
     set (MARCH "-march=haswell")
  else()
     set (MARCH "-march=native")
  endif()
endif()
if (DEFINED ENV{MARCH})
  set (MARCH "-march=$ENV{MARCH}")
endif()

Cifer · October 30, 2024, 2:45pm

The HPC of our school will be undergoing a maintenance these days, so I will try next Monday.

In fact I am not very sure if the nodes of built and running are the same…The student just uses a hostname to connect the hpc, compile solver and submit the job script, the job script could specify queues, but not nodes.

Thank you for your reply!

Cifer · November 4, 2024, 2:11am

Thank you for your attention to this issue!

This time, I did not attempt to build the solver directly from the source code, but built the solver from the conda environment without loading any modules. However, when submitting the script to run the solver, I must load the openmpi module, otherwise HPC cannot recognize the mpirun command.

Unfortunately, regardless of whether I edited the CMakeLists.txt as you suggested or not, after I submitted the script to run the solver, the HPC system always showed that my task was running, but there were no files in the working directory. This continued until the system automatically killed the task after the runtime reached walltime.

I will also try to communicate with the staff responsible for HPC at the school about this issue. Thank you very much!

cgw · November 4, 2024, 1:42pm

That’s surprising. There’s an MPI implementation included with the conda module, so if the mfix environment is active when you submit the job, the mpirun from the conda mfix package should be found… unless somehow your batch system is set up to not copy over the PATH environment variable that was in effect when the job was submitted.

Cifer · November 5, 2024, 1:57am

I tried to activate the mfix environment in the script submitted to the HPC, and it seems works, However, it reports new error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x2aec39be42ef in ???
#0 0x2ac3efafc2ef in ???
#1 0x55608fc172e7 in __stiff_chem_tfm_MOD_mapmfixtoode_tfm
at /home/svu/e1454408/miniforge3/envs/mfix-24.2.3/share/mfix/src/model/chem/stiff_chem_tfm.f90:384
#1 0x55f2cf9742e2 in __stiff_chem_tfm_MOD_mapmfixtoode_tfm
…

The whole err is as below:
err1.txt (2.9 KB)

This case works well in myself Ubuntu, so it looks strange. Do you have solutions about it?

Thank you very much!

cgw · November 5, 2024, 11:58am

This is interesting. Can you upload the updated project files so I can take a look? What is the domain decompositio you are using? The “Reaction.mfx” at the beginning of the thread has nodesi=nodesj=nodesk=1

Cifer · November 5, 2024, 12:30pm

OK, in fact, the mfx file is the same as before:

Reaction.mfx (16.8 KB)

but my script submitted to HPC is as below:

#!/bin/bash
#PBS -N MFiX_test
#PBS -q parallel24
#PBS -l select=1:ncpus=24:mpiprocs=24:mem=128GB
#PBS -l walltime=23:00:00
#PBS -o /home/svu/e1454408/MFiX_work/Test_CH4/out.log
#PBS -e /home/svu/e1454408/MFiX_work/Test_CH4/err.log

cd $PBS_O_WORKDIR; ## this line is needed, do not delete and change.
np=$( cat ${PBS_NODEFILE} |wc -l ); ### get number of CPUs, do not change

cd /home/svu/e1454408/MFiX_work/Test_CH4
source /home/svu/e1454408/miniforge3/etc/profile.d/conda.sh
conda activate mfix-24.2.3
mpirun -np 24 ./mfixsolver_dmp -f Reaction.mfx NODESI=6 NODESJ=4 NODESK=1

I used NODESI=6 NODESJ=4 NODESK=1 in the script, maybe I need to change it in the mfx file?

Cifer · November 5, 2024, 12:36pm

In addition, if I change nodesi, nodesj and nodesk in the mfx file, the error also exists.

cgw · November 6, 2024, 5:51pm

I ran this on our HPC cluster with the 6x4x1 domain decomposition and did not get any errors. How did you build the solver module?

Cifer · November 7, 2024, 1:34am

As you mentioned, I built the solver without loading any modules, just activate the mfix environment and built the solver by command ‘build_mfixsolver --batch --dmp -j’ in the command line.

In fact, the solver successfully ran for some time steps, and it reported the error at t=0.00216s. By the way, I tried another paid HPC and everything goes well, so I think the key issue is HPC of our school, I will communicate with the staff in our school about this problem.

Thank you very much!

cgw · November 7, 2024, 4:22pm

You might try building the solver with the system MPI modules loaded.

To be completely sure the bundled MPI and gfortran are not used, you can do:

(mfix-24.3.1) $ conda remove --force gcc gfortran mpi openmpi ld_impl_linux-64

(with the mfix conda environment activated). This is how we built the solver on our HPC cluster at NETL.

Good luck, and please let us know what you find!

Cifer · November 8, 2024, 2:28am

Thank you for your attention!

First, I activated the mfix environment and used your command to ensure the bundled MPI and gfortran are not used, and then I loaded gcc-7.3.0 and openmpi-4.0.0 in our HPC(the latest version of gcc in our HPC is 9.2.0, but it still could not read module mpi.mod). When I used ‘build_mfixsolver --batch --dmp -j’ to build solver, it reported the following error:

/bin/ld: CMakeFiles/udfs.dir/home/svu/e1454408/MFiX_work/Test_CH4_2/usr_rates.f.o: unable to initialize decompress status for section .debug_info
/bin/ld: CMakeFiles/udfs.dir/home/svu/e1454408/MFiX_work/Test_CH4_2/usr_rates.f.o: unable to initialize decompress status for section .debug_info
CMakeFiles/udfs.dir/home/svu/e1454408/MFiX_work/Test_CH4_2/usr_rates.f.o: file not recognized: File format not recognized

It seems like it cannot read the usr_rates.f, but I never modified this file.

Secondly, I also tried to build the solver from source code by cmake, it is successful. When I submitted the job, it reported:

Program received signal SIGILL: Illegal instruction.

Backtrace for this error:

…
#0 0x2abcfe3462ef in ???
#1 0x5fa5e0 in __dgtsv_mod_MOD_dgtsv
at /home/svu/e1454408/mfix-24.2.3/src/model/DGTSV.f:190

The detail is as below:

err2.txt (32.7 KB)

Do you have suggestions? Thank you!

cgw · November 8, 2024, 1:08pm

Did you do a make clean?
If you are using the openmpi on your HPC system (mpi.mod), you need to use the same compiler that was used to compile that

Cifer · November 9, 2024, 2:03am

Emmm…I used ‘make clean’, and I think gcc-7.3.0 and openmpi-4.0.0 are matched(because it built successfully from source code), but it still reported error when I built in the mfix conda environment:

/bin/ld: CMakeFiles/udfs.dir/home/svu/e1454408/MFiX_work/Test_CH4_2/usr_rates.f.o: unable to initialize decompress status for section .debug_info
/bin/ld: CMakeFiles/udfs.dir/home/svu/e1454408/MFiX_work/Test_CH4_2/usr_rates.f.o: unable to initialize decompress status for section .debug_info
CMakeFiles/udfs.dir/home/svu/e1454408/MFiX_work/Test_CH4_2/usr_rates.f.o: file not recognized: File format not recognized
collect2: error: ld returned 1 exit status
gmake[2]: * [mfixsolver_dmp] Error 1
gmake[1]: * [CMakeFiles/mfixsolver_dmp.dir/all] Error 2
gmake: *** [all] Error 2

                 BUILD FAILED

==========================================================================

Thank you for your patience! I wiil also find other way to conduct my simulation.