DEM case runs on CPU but fails on GPU

YupengXu · December 17, 2024, 3:06pm

And I am also running MFIX-Exa-DEM cases. It runs without problem with CPU, but failed at GPU
inputs.dat (62.2 KB)
And once the GPU case started, it was stuck at

jmusser · December 18, 2024, 9:17pm

I don’t see how the provided setup could run. For starters, the ‘side inlet’ in the geometry is open so it should be a mass inflow – not an ‘eb’ boundary condition – and therefore it cannot be used with DEM particles.

As I always suggest, start simple and add complexity to the setup in a systematic approach. You will never figure out what’s wrong if you throw everything into the initial setup and it doesn’t run.

Attached is your setup, simplified for non-reacting flow. I’ve tested this setup on both CPU and GPU.
dem-debug.tar (20 KB)

I made the domain a little taller (0.43 m → 0.48 m) so that the width, depth, and height and all nicely divisible by 24-cube grids. ‘Nice grids’ make it easier on the MLMG solver.
The side inlet – as defined in the original inputs with the provided CSG geometry – will not work for DEM. I papered over the inlet to get the case to run, but the geometry has to be modified if you want to bring DEM particles into the domain there.
You used a fairly stiff spring constant at 2.5K. I reduced it to 25. which works for as much as I’ve tested. You’ll have to live with long times to solution if you roll back to 2.5K as the DEM time step is around 1.e-8 seconds.

The below figure shows a few different ‘grid’ decompositions. Note that “more” isn’t always “faster” and again, a single grid on a single GPU is about 5x faster than the best preforming CPU case.

YupengXu · December 19, 2024, 1:46pm

Does DEM and PIC need a different way as a side inlet? EB close end tube for PIC ? Open tube for DEM? @jmusser

YupengXu · December 19, 2024, 1:53pm

And one thing to double check, the experimental setup is 0.43 m and I need the correct height for the residence time of both the particle and gas. In this case, is there another way to make the geometry easier on the MLMG solver?

jmusser · December 19, 2024, 2:09pm

Correct.

PIC parcels can be added during a simulation through mass inflow (mi) and embedded boundary (eb) boundary conditions.
DEM particles can only be added to a simulation through an embedded boundary (EB) surface.

The reason for this is because PIC parcels are placed inside the domain whereas DEM particles are placed outside the domain and pushed through the boundary over several DEM sub-steps. While entering, DEM particles are shuffled around, ignoring collision forces, until they have fully entered. This is done to prevent excessive overlaps between existing and entering DEM particles.

For a little additional context,

mass inflow boundary conditions are defined on domain extents – the faces of the bounding box that defines the physical simulation space
embedded boundary inflows are areas of the system geometry’s surface where fluid and/or particle inflow is defined. We sometimes refer to this as flow-thru-eb to distinguish this from a mass inflow.

If we were to place entering DEM particles outside the domain in the same manner that we do for the EB, they would simply get deleted because AMReX currently does not allow particles to exist outside of the domain.

jmusser · December 19, 2024, 2:23pm

One approach that is not necessarily intuitive, is adding more “dead space” around the geometry by increase the domain width and depth can provide additional flexibility in defining grids that work well with the MLMG solver.

You can also play around with the cell size. However, MFIX-Exa currently requires the cells to be uniform (dx == dy == dz). There are plans to remove this limitation moving forward but this activity has not yet started.

YupengXu · December 19, 2024, 4:17pm

@jmusser I am building the GPU case based on your input file by adding the setting one step by one step. Here I have 2 cases,

If I only have one species for gas (N2), sand (Sand) and biomass (Ash), it runs. -->inputs-1
But as I added all the species into the inputs, it can’t go through. → input.
BTW, I used a different CSG file for the DEM case here (for the CPU runs I mentioned also) .
Please help me to check it.
files.tar (70 KB)

jmusser · December 19, 2024, 4:24pm

What does “it can’t go through” mean?

YupengXu · December 19, 2024, 4:30pm

@jmusser
The out.log:

The error log:

jmusser · December 19, 2024, 4:35pm

Jumping from one species to 30 is not an incremental change.

YupengXu · December 19, 2024, 4:38pm

Does the species number cause this?
Should I increase the species number gradually to figure out the cause?
And any suggestion on the possible cause of this? Thanks!

jmusser · December 19, 2024, 6:34pm

This has the hallmarks of a GPU out-of-memory issue. One way to test is to add a few species to the fluid and see what happens. If your setup runs, add a few more and test again. If you’ve added all the fluid species and it still runs, start adding species to the particles. If it is an issue of running out of memory, you’ll eventually find the tipping point where the simulation crashes.

Now let’s take a look at your inputs file. Your file has the following entry:

amrex.the_arena_init_size=7516192768

This tells AMReX how much GPU memory to allocate per MPI task. Specifically, your setup will utilize ~7500MB of GPU memory. If you’re running on one of Joule3’s NVIDIA H100 GPUs, that’s less than 10% of the GPU memory available!

You can remove this setting from your inputs file and let AMReX use the entire GPU if you are running one MPI task per GPU --which is what we strongly recommend. The AMReX docs contain a lot of information on memory management.

The inputs amrex.abort_on_out_of_gpu_memory = 1 is also handy when debugging suspected GPU memory issues. As you may guess, AMReX will abort if you run out of GPU memory.

If you’ve tried all that and you still run out of GPU memory, then you need to split the problem up and use several GPUs.

YupengXu · December 19, 2024, 7:04pm

Thanks for the suggestions. I did found the tipping point where the simulation crashed, with
amrex.the_arena_init_size=7516192768 commented out.

YupengXu · December 21, 2024, 3:51am

Tried to run the case with 4 GPUs, with all the species included and non-reacting (didn’t include the mfix_usr_reactions_rates_K.H ), the simulation could run, but only with a very small number of sand particles.
So in this case, there are two solids: sand and biomass. Sand only has one species (Sand), while biomass has 30 species. In the code, do the sand particles also have the biomass species information saved?

jmusser · December 21, 2024, 2:14pm

All particles track all solids.species as defined in the inputs.

Please define what constitutes a very small number of sand particles.

I’m guessing it’s how you are decomposing the domain that is preventing you from making efficient use of the hardware. For example, if you divide the domain into 8 24-cube grids and run on 8 GPUs, you may not see much if any benefit because the majority of the particles will be located in the bottom one or two grids.

YupengXu · December 22, 2024, 2:24am

Here are the decomposing:
amr.n_cell = 24 24 168
amr.max_grid_size_x = 12
amr.max_grid_size_y = 12
amr.max_grid_size_z = 24

I tried 3 cases: ic.init-bed2.sand.volfrac = 0.001, 0.01, 0.1
Which correspond to 8,849, 64,424, 620,848 sand particles,
The first 2 cases ran but would stop after 565 and 352 iterations, the 3rd case failed at the first iteration.

The initial particle distributed evenly in the lower half of the bed on multiple GPUs
particles

epg

jmusser · December 22, 2024, 1:24pm

Stop, take a moment, and think about what you are doing.

Are you running this configuration on 28 GPUs? If the answer is ‘no’ --which is what I highly suspect-- then why are you decomposing the domain into 28 grids?

(24/12) * (24/12) * (168/24) = 2 * 2 * 7 = 28

MFIX-Exa can run with more than 1 MPI task per GPU, but rarely will you see a performance benefit. As already noted in this thread

running one MPI task per GPU [is] strongly recommended

It is apparent you are changing inputs (a.k.a., turning knobs) without understanding what they do or how they impact the setup. Spend some time reading MFIX-Exa’s and AMReX’s documentation, more so the latter than the former. Try to identify what it is you don’t understand, then ask a question. Otherwise, all this turns into is a very long, roundabout way of me, or someone else on the forum, setting up your case for you.

YupengXu · December 22, 2024, 2:01pm

Actually, I was using it with 28 devices.

jmusser · January 16, 2025, 2:05pm

We’ve identified a bug that is responsible for the issue you are seeing.

A few routines contain logic to use GPU shared memory --very fast, but limited memory on the GPU-- when kernel storge requirements are small enough to fit. If the amount of memory needed exceeds what’s available in shared memory, the kernel will use the larger, slower main GPU memory.

This case runs without issue when you only have a few fluid species, because the limited shared memory is sufficient. Once you add enough fluid species, the logic to use global memory is tripped in the routine that computes fluid-particle transfer coefficients. Within that routine, there is a bug in the global memory indexing that causes the drag coefficient calculation to use junk values which in turn causes particles to go bananas when the drag force is applied during the update.

The bug fix was merged into the develop branch this morning.

YupengXu · January 16, 2025, 2:33pm

Great news and thanks for your help. I will test it out.