Ascent crash: Floating point exception

oyedejifemi · March 19, 2026, 3:52pm

I am reaching out for some guidance regarding a simulation that has consistently stalled at a specific point in time. I am able to trace the issue to the ascent output base on the output. I have attached the associated stderr file and ascent yaml.

ascent_actions.yaml (3.3 KB)
25542.stderr.txt (2.2 KB)

Thank you in advance for your help!

wfullmer · March 19, 2026, 5:08pm

So it’s not happening on the first ascent output? It’s running for a period of time going through multiple calls to ascent and then dying?

Can you share the inputs that go with this case? If not, can you simplify the inputs to a sharable state that reproduce the error and include those?

oyedejifemi · March 19, 2026, 5:26pm

Thanks, @wfullmer. This case in dying after multiple ascent calls. Interestingly, it happens with only this case. I had a case with a different pic.pressure_coefficient that ran without issues.

I have attached the case files below.
geometry.csg (679 Bytes)
inputs.txt (52.1 KB)

wfullmer · March 24, 2026, 3:34pm

I’m guessing you got this from @YupengXu?

oyedejifemi · March 24, 2026, 4:15pm

You are correct. This setup was adapted from @YupengXu setup.

YupengXu · March 24, 2026, 4:54pm

Disclaimer

This work was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

wfullmer · March 24, 2026, 5:33pm

At what step did it crash? It crashed immediately for me on one GPU but it’s still running fine for me at step > 8000 and time > 2.5s on 72 CPUs.

oyedejifemi · March 24, 2026, 5:55pm

The crashed happened around 9.6s. Do you know why the GPU crashes and CPU does not? Could it be because the amrax arena parameters were not set?

wfullmer · March 24, 2026, 6:37pm

ah, yep. running w/ managed memory turned on. CPU run is at 5s.

wfullmer · March 24, 2026, 8:40pm

CPU run died from a different ascent error at 9.06. Interesting that it printed scene1 but not scene2 … because scene2 is just a subset of scene1. And then when I try to restart to see if I get the same failure, I immediately run into Erroneous arithmetic operation. That’s annoying.

cgw · March 25, 2026, 12:21pm

It’s always helpful to include backtraces when an FPE or segfault occurs. In this case the FPE is due to an initialization problem (NaN value) and a fix is on the way.

oyedejifemi · March 25, 2026, 1:53pm

Thank you @wfullmer and @cgw

jmusser · March 25, 2026, 2:27pm

To clarify, we have identified the reason an FPE is raised on restart. It is linked to a recent change in variable initialization, which in turn exposed a previously unknown issue. We are currently evaluating the best approach for addressing this moving forward.

The separate issue involving the simulation failing during a call to Ascent is still under active investigation. At this time, we do not yet have a confirmed cause or solution.

cgw · March 26, 2026, 3:37am

@oyedejifemi

I ran this on Joule with 128 MPI processes (CPU) and ran to completion (10s) with no Ascent errors.

Note that I am using newer versions of all the dependent libraries (Ascent, vtk-m which is now Viskores, etc).

ascent      v0.9.5-20-gf67dc0b8 (git HEAD)
Catch2      v3.13.0-ccc49ba6 (git HEAD)
cgal        v6.2-058401a6088 (git HEAD)
conduit     v0.9.5-85-gd5169bbb (git HEAD)
gmp         v6.2.1
hypre       v3.1.0-12-g7e403d172 (git HEAD)
mfpr        v4.1.0
PEGTL       3.2.4-229-g54a2e32b (git HEAD)
viskores    v1.1.0-107-ga804ed143 (git HEAD)

You may get better results with a newer set of libraries. Bugs are constantly being fixed upstream.

Attached are the scripts I used to build all the deps and mfix-exa:

build_deps.sh (2.7 KB)
build_exa.sh (639 Bytes)

– Charles

oyedejifemi · March 26, 2026, 4:13pm

Thanks, @cgw. I will update my libraries and test the case.

cgw · March 26, 2026, 6:27pm

Ooops! I had inputs.txt and geometry.csv in the test dir but did not have ascent_actions.yaml. (If an actions file is mentioned in the input file and is not present, this is not flagged as an error, perhaps it should be.)

I’m repeating my test with the file in place, will let you know what happens.
Sorry for the confusion.

– Charles

cgw · March 26, 2026, 7:48pm

So, I have to retract everything I said about Ascent libraries, etc. (wipes egg off face)

I’m getting a crash with this case, using both 26.03 and the current Git version of mfix (unreleased)
In both cases:

a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 39 with PID 3008810 on node n0419 exited on signal 9 (Killed).
--------------------------------------------------------------------------

and, unfortunately, there is no backtrace.

Happened at similar but not identical times:

   Step 10944: from old_time 3.559774841 to new time 3.560129431 with dt = 0.0003545901963

   Step 10888: from old_time 3.539728626 to new time 3.540076954 with dt = 0.000348327814

The investigation continues…

cgw · March 30, 2026, 12:38pm

OK, I’ve made a bit more progress on this. My last failure was simply due to OOM (Ascent is very memory-hungry). Using fewer cores/node I’m past that problem and have reproduced the crash Femi reported. Unfortunately it doesn’t happen on every run, it seems to happen about 50% of the time.

I rebuilt MFIX-Exa and all the libs with -DCMAKE_BUILD_TYPE=Debug and although the job runs extremely slowly, I captured two crashes. With all the assertions turned on, the FPE turns into an Abort. I think the FPE might have been due to reading junk data.

Here’s the failing assert:

Sources/viskores/viskores/internal/ArrayPortalBasic.h:73: viskores::internal::ArrayPortalBasicRead<T>::ValueType viskores::internal::ArrayPortalBasicRead<T>::Get(viskores::Id) const [with T = viskores::Vec<double, 3>; ValueType = viskores::Vec<double, 3>; viskores::Id = int]: 
Assertion `index < this->NumberOfValues' failed.

And here’s the full error log:
err.txt (254.5 KB)

I opened a bug report on Viskores github:

github.com/Viskores/viskores

viskores::rendering:: raytracing::detail::FindSphereAABBs Assertion `index < this->NumberOfValues' failed

opened 12:36PM - 30 Mar 26 UTC

charlesgwaldman

I'm getting an intermittent failure calling Ascent from mfix-exa. With the prod…uction build it manifests as an FPE but with `BUILD_TYPE=Debug` I'm getting an assertion error: ``` Assertion `index < this->NumberOfValues' failed ``` I think the FPE is from reading junk data past the end of the array (but not far enough past the end to segfault). I'm continuing to investigate the failure (building with `-fsanitize=address`) but wanted to open an issue here to see if anyone has any ideas. Here's the full (MPI) stack trace: [err.txt](https://github.com/user-attachments/files/26350080/err.txt)

I have some suspicions about what’s going on here. It might have to do with particles on a cell boundary, but I’m not sure. Continuing to test.

cgw · March 31, 2026, 12:01am

@oyedejifemi

There’s also something very odd about the png files Ascent is creating:

I’m seeing this message in the output log:

s2/p2 pseudocolor plot yielded no data, i.e., no cells remain

which is the same message Yupeng reported in a different case.

It seems to be an issue with the slice filter. With 24 cells in y and max_grid_size = 6, you get 4 box divisions in y. The domain is y = -0.05 to 0.05, so the box boundaries in y fall at -0.05, -0.025, 0.0, 0.025, 0.05. The slice at y=0 sits exactly on a box boundary — so some boxes include it, some don’t, and you get that patchy checkerboard.

This is an Ascent/Viskores issue with how the slice filter handles the AMReX box decomposition. A few things that could help:

Larger max_grid_size so fewer, bigger boxes means fewer misses
Making sure the slice position is centered on the domain rather than near a box boundary
Using a volume render or a different filter instead of a planar slice

I’m not yet sure whether this is the cause of the crash, or a separate issue.

wfullmer · March 31, 2026, 12:37am

That’s not what my images look like (see above). There’s nothing inherently wrong with that error message, that just means there’s nothing to display for some plot, e.g., you’re visualizing particles but you start w/ no particles and they all flow in (as one current post on here), or you’re doing contours of some turbulent iso-surface and you have to wait for it to develop, e.g., jordan/jeff’s amr post. I will say though that they all come out at the end though (at least for a release build), so if you’re looking to catch that warning on the first print, no dice.