Recently, I wanted to submit my case to the HPC in my school, so firstly, I decided to run a simple case on the HPC to test. It is a very simple case about combustion of CH4:
In detailed, I build the solver from source in the HPC use the command below:
cmake … -DENABLE_MPI=1 -DMPI_Fortran_COMPILER=mpifort -DCMAKE_Fortran_FLAGS=“-O2 -g” make -j
and I build it with cmake/3.20.6, gcc/7.3.0 and openmpi/4.0.0. The build of solver is success. However, when I submit the case, it returns err. The HPC of our school use the PBS system and I am not sure if my submit script is run. The sample script of mpi in the HPC recommended xe_2015, but I find xe_2015 could not configure the solver. The sample script and my submit script is attached:
gcc 7.3.0 is pretty old - is this the most recent compiler available on your HPC?
Did you build the solver on the same type of machine as you are running it on? That is, did you build on your own Ubuntu system or on one of the HPC nodes?
Can you please repeat the second step of the build: first do make clean then
I tried gcc 9.2.0 on the HPC node (it is the latest version in the HPC of our school), and the configuration is successful, but when I used the command ‘make -j’, it reports error:
** 18 | USE mpi**
** | 1** Fatal Error: Cannot read module file ‘mpi.mod’ opened at (1), because it was created by a different version of GNU Fortran compilation terminated.
Yes, in fact, I built the solver by cmake/3.20.6, gcc/7.3.0 and openmpi/4.0.0 on the node of the HPC, and the built is successful.
Ok, the mpi.mod incompatibility is due to the MPI module you have loaded on your HPC. Note that the MFiX Conda packages include mpi 5 and gfortran 14, so you don’t have to load any modules at all. Can you try building and running the solver without loading environment modules?
The “illegal instruction” looks like it’s coming from a difference between the build node and the run node… are you sure the host you are building on is the same type?
If all else fails, try this:
Edit the file $CONDA_PREFIX/share/mfix/src/model/CMakeLists.txt and remove (or comment out) this section:
if (APPLE)
set (MARCH "")
else()
if (DEFINED ENV{CONDA_BUILD})
set (MARCH "-march=haswell")
else()
set (MARCH "-march=native")
endif()
endif()
if (DEFINED ENV{MARCH})
set (MARCH "-march=$ENV{MARCH}")
endif()
The HPC of our school will be undergoing a maintenance these days, so I will try next Monday.
In fact I am not very sure if the nodes of built and running are the same…The student just uses a hostname to connect the hpc, compile solver and submit the job script, the job script could specify queues, but not nodes.
This time, I did not attempt to build the solver directly from the source code, but built the solver from the conda environment without loading any modules. However, when submitting the script to run the solver, I must load the openmpi module, otherwise HPC cannot recognize the mpirun command.
Unfortunately, regardless of whether I edited the CMakeLists.txt as you suggested or not, after I submitted the script to run the solver, the HPC system always showed that my task was running, but there were no files in the working directory. This continued until the system automatically killed the task after the runtime reached walltime.
I will also try to communicate with the staff responsible for HPC at the school about this issue. Thank you very much!
That’s surprising. There’s an MPI implementation included with the conda module, so if the mfix environment is active when you submit the job, the mpirun from the conda mfix package should be found… unless somehow your batch system is set up to not copy over the PATH environment variable that was in effect when the job was submitted.
I tried to activate the mfix environment in the script submitted to the HPC, and it seems works, However, it reports new error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error: #0 0x2aec39be42ef in ??? #0 0x2ac3efafc2ef in ??? #1 0x55608fc172e7 in __stiff_chem_tfm_MOD_mapmfixtoode_tfm
at /home/svu/e1454408/miniforge3/envs/mfix-24.2.3/share/mfix/src/model/chem/stiff_chem_tfm.f90:384 #1 0x55f2cf9742e2 in __stiff_chem_tfm_MOD_mapmfixtoode_tfm
…
This is interesting. Can you upload the updated project files so I can take a look? What is the domain decompositio you are using? The “Reaction.mfx” at the beginning of the thread has nodesi=nodesj=nodesk=1
As you mentioned, I built the solver without loading any modules, just activate the mfix environment and built the solver by command ‘build_mfixsolver --batch --dmp -j’ in the command line.
In fact, the solver successfully ran for some time steps, and it reported the error at t=0.00216s. By the way, I tried another paid HPC and everything goes well, so I think the key issue is HPC of our school, I will communicate with the staff in our school about this problem.