MPI_ERR_TRUNCATE when re_indexing=.True

hh1 · July 23, 2024, 10:54am

Hello,
on our cluster, we get the following error when enabling re_indexing=.True.

Total number of particles in the system: 257877
[base:3909464] *** An error occurred in MPI_Waitall
[base:3909464] *** reported by process [4279762945,3]
[base:3909464] *** on communicator MPI_COMM_WORLD
[base:3909464] *** MPI_ERR_TRUNCATE: message truncated
[base:3909464] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[base:3909464] *** and potentially your MPI job)
[base.xxx:3909457] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[base.xxx:3909457] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages

The error only appears when re_indexing is enabled, without the simulation runs fine, but we have 90% blocked cells so are hoping that re_indexing would speed up the simulation.
The error happens with MFIX 23.2 and 24.2, and MPI versions 3.1.6, 4.1.0, 4.1.1, 5.0.0. The 5.0.0 is from conda, all others are self-compiled). The confusing thing is, that on my students computer in WSL Ubuntu the error does not appear and on my Debian desktop it also works (although I get a float exception a bit later).

From searching the web, the common opinion is that MPI_ERR_TRUNCATE is caused by a bug in the program (too small buffer allocated), but then why does it work on some computers?
Is there anything in the MPI or network config that could cause this error?
Any help is appreciated!

jeff.dietiker · August 6, 2024, 6:55pm

I am not sure. How many cores and what decomposition are you using?