Ascent crash: Floating point exception

I am reaching out for some guidance regarding a simulation that has consistently stalled at a specific point in time. I am able to trace the issue to the ascent output base on the output. I have attached the associated stderr file and ascent yaml.

ascent_actions.yaml (3.3 KB)
25542.stderr.txt (2.2 KB)

Thank you in advance for your help!

So it’s not happening on the first ascent output? It’s running for a period of time going through multiple calls to ascent and then dying?

Can you share the inputs that go with this case? If not, can you simplify the inputs to a sharable state that reproduce the error and include those?

Thanks, @wfullmer. This case in dying after multiple ascent calls. Interestingly, it happens with only this case. I had a case with a different pic.pressure_coefficient that ran without issues.

I have attached the case files below.
geometry.csg (679 Bytes)
inputs.txt (52.1 KB)

I’m guessing you got this from @YupengXu?

You are correct. This setup was adapted from @YupengXu setup.

Disclaimer

This work was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

2 Likes

At what step did it crash? It crashed immediately for me on one GPU but it’s still running fine for me at step > 8000 and time > 2.5s on 72 CPUs.

The crashed happened around 9.6s. Do you know why the GPU crashes and CPU does not? Could it be because the amrax arena parameters were not set?

ah, yep. running w/ managed memory turned on. CPU run is at 5s.

1 Like

CPU run died from a different ascent error at 9.06. Interesting that it printed scene1 but not scene2 … because scene2 is just a subset of scene1. And then when I try to restart to see if I get the same failure, I immediately run into Erroneous arithmetic operation. That’s annoying.

1 Like

It’s always helpful to include backtraces when an FPE or segfault occurs. In this case the FPE is due to an initialization problem (NaN value) and a fix is on the way.

1 Like

Thank you @wfullmer and @cgw

To clarify, we have identified the reason an FPE is raised on restart. It is linked to a recent change in variable initialization, which in turn exposed a previously unknown issue. We are currently evaluating the best approach for addressing this moving forward.

The separate issue involving the simulation failing during a call to Ascent is still under active investigation. At this time, we do not yet have a confirmed cause or solution.

@oyedejifemi

I ran this on Joule with 128 MPI processes (CPU) and ran to completion (10s) with no Ascent errors.

Note that I am using newer versions of all the dependent libraries (Ascent, vtk-m which is now Viskores, etc).

ascent      v0.9.5-20-gf67dc0b8 (git HEAD)
Catch2      v3.13.0-ccc49ba6 (git HEAD)
cgal        v6.2-058401a6088 (git HEAD)
conduit     v0.9.5-85-gd5169bbb (git HEAD)
gmp         v6.2.1
hypre       v3.1.0-12-g7e403d172 (git HEAD)
mfpr        v4.1.0
PEGTL       3.2.4-229-g54a2e32b (git HEAD)
viskores    v1.1.0-107-ga804ed143 (git HEAD)

You may get better results with a newer set of libraries. Bugs are constantly being fixed upstream.

Attached are the scripts I used to build all the deps and mfix-exa:

build_deps.sh (2.7 KB)
build_exa.sh (639 Bytes)

– Charles

Thanks, @cgw. I will update my libraries and test the case.

Ooops! I had inputs.txt and geometry.csv in the test dir but did not have ascent_actions.yaml. (If an actions file is mentioned in the input file and is not present, this is not flagged as an error, perhaps it should be.)

I’m repeating my test with the file in place, will let you know what happens.
Sorry for the confusion.

– Charles

1 Like

So, I have to retract everything I said about Ascent libraries, etc. (wipes egg off face)

I’m getting a crash with this case, using both 26.03 and the current Git version of mfix (unreleased)
In both cases:

a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 39 with PID 3008810 on node n0419 exited on signal 9 (Killed).
--------------------------------------------------------------------------

and, unfortunately, there is no backtrace.

Happened at similar but not identical times:

   Step 10944: from old_time 3.559774841 to new time 3.560129431 with dt = 0.0003545901963

   Step 10888: from old_time 3.539728626 to new time 3.540076954 with dt = 0.000348327814

The investigation continues…

OK, I’ve made a bit more progress on this. My last failure was simply due to OOM (Ascent is very memory-hungry). Using fewer cores/node I’m past that problem and have reproduced the crash Femi reported. Unfortunately it doesn’t happen on every run, it seems to happen about 50% of the time.

I rebuilt MFIX-Exa and all the libs with -DCMAKE_BUILD_TYPE=Debug and although the job runs extremely slowly, I captured two crashes. With all the assertions turned on, the FPE turns into an Abort. I think the FPE might have been due to reading junk data.

Here’s the failing assert:

Sources/viskores/viskores/internal/ArrayPortalBasic.h:73: viskores::internal::ArrayPortalBasicRead<T>::ValueType viskores::internal::ArrayPortalBasicRead<T>::Get(viskores::Id) const [with T = viskores::Vec<double, 3>; ValueType = viskores::Vec<double, 3>; viskores::Id = int]: 
Assertion `index < this->NumberOfValues' failed.

And here’s the full error log:
err.txt (254.5 KB)

I opened a bug report on Viskores github:

I have some suspicions about what’s going on here. It might have to do with particles on a cell boundary, but I’m not sure. Continuing to test.

1 Like

@oyedejifemi

There’s also something very odd about the png files Ascent is creating:

I’m seeing this message in the output log:

s2/p2 pseudocolor plot yielded no data, i.e., no cells remain

which is the same message Yupeng reported in a different case.

It seems to be an issue with the slice filter. With 24 cells in y and max_grid_size = 6, you get 4 box divisions in y. The domain is y = -0.05 to 0.05, so the box boundaries in y fall at -0.05, -0.025, 0.0, 0.025, 0.05. The slice at y=0 sits exactly on a box boundary — so some boxes include it, some don’t, and you get that patchy checkerboard.

This is an Ascent/Viskores issue with how the slice filter handles the AMReX box decomposition. A few things that could help:

  • Larger max_grid_size so fewer, bigger boxes means fewer misses
  • Making sure the slice position is centered on the domain rather than near a box boundary
  • Using a volume render or a different filter instead of a planar slice

I’m not yet sure whether this is the cause of the crash, or a separate issue.

That’s not what my images look like (see above). There’s nothing inherently wrong with that error message, that just means there’s nothing to display for some plot, e.g., you’re visualizing particles but you start w/ no particles and they all flow in (as one current post on here), or you’re doing contours of some turbulent iso-surface and you have to wait for it to develop, e.g., jordan/jeff’s amr post. I will say though that they all come out at the end though (at least for a release build), so if you’re looking to catch that warning on the first print, no dice.