I am reaching out for some guidance regarding a simulation that has consistently stalled at a specific point in time. I am able to trace the issue to the ascent output base on the output. I have attached the associated stderr file and ascent yaml.
Thanks, @wfullmer. This case in dying after multiple ascent calls. Interestingly, it happens with only this case. I had a case with a different pic.pressure_coefficient that ran without issues.
This work was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.
CPU run died from a different ascent error at 9.06. Interesting that it printed scene1 but not scene2 … because scene2 is just a subset of scene1. And then when I try to restart to see if I get the same failure, I immediately run into Erroneous arithmetic operation. That’s annoying.
It’s always helpful to include backtraces when an FPE or segfault occurs. In this case the FPE is due to an initialization problem (NaN value) and a fix is on the way.
To clarify, we have identified the reason an FPE is raised on restart. It is linked to a recent change in variable initialization, which in turn exposed a previously unknown issue. We are currently evaluating the best approach for addressing this moving forward.
The separate issue involving the simulation failing during a call to Ascent is still under active investigation. At this time, we do not yet have a confirmed cause or solution.
Ooops! I had inputs.txt and geometry.csv in the test dir but did not have ascent_actions.yaml. (If an actions file is mentioned in the input file and is not present, this is not flagged as an error, perhaps it should be.)
I’m repeating my test with the file in place, will let you know what happens.
Sorry for the confusion.
So, I have to retract everything I said about Ascent libraries, etc. (wipes egg off face)
I’m getting a crash with this case, using both 26.03 and the current Git version of mfix (unreleased)
In both cases:
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 39 with PID 3008810 on node n0419 exited on signal 9 (Killed).
--------------------------------------------------------------------------
and, unfortunately, there is no backtrace.
Happened at similar but not identical times:
Step 10944: from old_time 3.559774841 to new time 3.560129431 with dt = 0.0003545901963
Step 10888: from old_time 3.539728626 to new time 3.540076954 with dt = 0.000348327814
OK, I’ve made a bit more progress on this. My last failure was simply due to OOM (Ascent is very memory-hungry). Using fewer cores/node I’m past that problem and have reproduced the crash Femi reported. Unfortunately it doesn’t happen on every run, it seems to happen about 50% of the time.
I rebuilt MFIX-Exa and all the libs with -DCMAKE_BUILD_TYPE=Debug and although the job runs extremely slowly, I captured two crashes. With all the assertions turned on, the FPE turns into an Abort. I think the FPE might have been due to reading junk data.
There’s also something very odd about the png files Ascent is creating:
I’m seeing this message in the output log:
s2/p2 pseudocolor plot yielded no data, i.e., no cells remain
which is the same message Yupeng reported in a different case.
It seems to be an issue with the slice filter. With 24 cells in y and max_grid_size = 6, you get 4 box divisions in y. The domain is y = -0.05 to 0.05, so the box boundaries in y fall at -0.05, -0.025, 0.0, 0.025, 0.05. The slice at y=0 sits exactly on a box boundary — so some boxes include it, some don’t, and you get that patchy checkerboard.
This is an Ascent/Viskores issue with how the slice filter handles the AMReX box decomposition. A few things that could help:
Larger max_grid_size so fewer, bigger boxes means fewer misses
Making sure the slice position is centered on the domain rather than near a box boundary
Using a volume render or a different filter instead of a planar slice
I’m not yet sure whether this is the cause of the crash, or a separate issue.
That’s not what my images look like (see above). There’s nothing inherently wrong with that error message, that just means there’s nothing to display for some plot, e.g., you’re visualizing particles but you start w/ no particles and they all flow in (as one current post on here), or you’re doing contours of some turbulent iso-surface and you have to wait for it to develop, e.g., jordan/jeff’s amr post. I will say though that they all come out at the end though (at least for a release build), so if you’re looking to catch that warning on the first print, no dice.