MPI - bus error: nonexistent physical address

kjetilbmoe · November 22, 2021, 12:09pm

Hi,
having compiled with mpiifort, the code and a project ran for about 300 cpu hours before running into some sort of problem. There were no other signs of divergence or small time steps before this. Do you have a clue to what might have gone wrong? This was a 2D TFM distributed on 16 cores.

cgw · November 22, 2021, 3:34pm

@kjetilbmoe -

I’ve never seen this error before, and I expect that it’s going to be difficult to debug, since it took so long to happen.

Can you post your project files (“Submit bug report” makes a .zip file you can post here).

Thanks,

– Charles

cgw · November 22, 2021, 3:37pm

You could try decreasing MAX_DT, but that’s really just a shot in the dark

cgw · November 22, 2021, 3:48pm

Since the error is happening in the “BiCGS” code, you could try using a different leq_method as a work-around

cgw · November 22, 2021, 4:52pm

@kjetil - did you get a core dump? it would be interesting to know what the value of IJK was when the program failed

kjetilbmoe · November 22, 2021, 5:49pm

Yes I did get one on 5GB. But I don’t really know how to handle these. I’m sure it can be provided somehow if you would like to have a look.

kjetilbmoe · November 22, 2021, 5:51pm

I am not able to post a bug report from the actual system, I’m running this only from command line. But I could do this from the GUI on a different system, if that’s better? Or I could try to zip some essential files if I’m told which to include.

cgw · November 22, 2021, 5:57pm

I can’t look at the core file for you. For one thing, debugging like this requires a matching setup (hardware, libraries, etc), and furthermore we don’t have the resources to provide support at this level.

But I can give you some tips on using gdb:

Log into the node where the core dump happened (if possible). Go to the directory where the core dump sits.
First, issue the command file core, this should print something like “Core file from executable </full/path/to/mfixsolver>”

Then do:

gdb /full/path/to/mfixsolver core
After GDB starts up try commands like where and backtrace.
up and down let you navigate stack frames. If you can get to the stack frame where the error happened, do print ijk (at the (gdb) prompt).

info and info locals are also good GDB commands to know.

I’m sure you can find some gdb tutorials on the web if you want to go farther.

Let me know if you find anything interesting, or have any questions.

– Charles

kjetilbmoe · November 26, 2021, 11:55am

Thank you for the instructions. If I’m doing this right, gdb opens up at the frame where the actual error happened, so I don’t have to move up or down. When executing print ijk, the output is

251                             A_M(IJK,:) = A_M(IJK,:)*OAM
(gdb) print ijk
$1 = 0
(gdb) print i
$2 = 107
(gdb) print k
$3 = 1
(gdb) print j
$4 = 1
(gdb)

Edit: turns out the OAM has value 0.

cgw · November 29, 2021, 9:22pm

Hi Kjetil. Thanks for the debugging.

Multiplying by 0 (OAM) is fine, this should not generate an error. But the 0 value of ijk looks suspicious to me - as far as I understand everything in MFiX is 1-based and 0 is never a valid index. It would be good to know if 107,1,1 are valid values for i, j, and k.

@jeff.dietiker can you comment on this?

– Charles

jeff.dietiker · November 29, 2021, 11:03pm

It is hard to tell what is wrong, definitely something I have never seen. What I can tell is I=107, J=1 and K=1 cannot translate to IJK=0, and if it did it should not occur after 300 hours (the mapping is computed at the beginning of the run).
The OAM should also never be zero since OAM=1/aijmax.

Did you do a restart at any point in time? What kind of geometry are you using?

If you can attach your .mfx file and any udf we can see if we can reproduce. If the .RES file is not too large, you can attach that too (zip all files together).

kjetilbmoe · December 8, 2021, 10:41pm

Sorry for the delayed response.
I did restart after 4 seconds, changing numerics. (Note that the 300hours are cpu hours, the simulations stops after 33 seconds wall clock time).

The geometry is a simple cylinder, acting as a bubbling fluidized bed with 4 solid phases.

I will prepare files to upload when I get to work in the morning.

kjetilbmoe · December 10, 2021, 1:15pm

Quick update, was able to restart it with different numerics (changed from SMART to Superbee).

The .res file is 66mb, so perhaps too large to put here?

cgw · December 10, 2021, 5:02pm

Is the crash with discretize=3 (SMART) reproducible? If it is we have a much better chance of fixing this.

Also I note that the doc for the discretize keyword says:

Valid values
• 0 First‑order upwinding.
• 1 First‑order upwinding (using down‑wind factors).
• 3 Smart.
• 2 Superbee (recommended method).
• 5 QUICKEST (does not work).
• 4 ULTRA‑QUICK.
• 7 van Leer.
• 6 MUSCL.
• 8 minmod.
• 9 Central (often unstable; useful for testing).

It’s noteworthy that Superbee is the recommended method, but the default value is 0 (first-order upwind).

– Charles

kjetilbmoe · December 12, 2021, 10:38pm

Realized that I gave a seemingly positive response after switching from Smart to Superbee, but that was actually on a different project than the one I started this thread with. (A similar, but 3D project.) So I had to test the actual one with changing the numerics, and now fortunately it worked here as well.

I originally had a look at the 04 case in the validation manual, which inspired me to use the Smart numerics. I then had some luck in using this from the very start as well on some simulations, instead of running a an initial case with, say, first order upwind. (3.4. FLD04: Gresho vortex problem — MFiX Third Edition documentation) . But from this recent experience I will probably avert from using it.

I haven’t put much extra effort in reproducing the result/faults over again, but to me it seems that changing numerics is the one thing that made it continue as normal.