I saw a LinkedIn post from Professor Bruno Blais this weekend using his higher-order FEM CFD-DEM code on a simple benchmark problem from Pepiot & Desjardins (2012). I’ve actually never set this case up in exa and figured, let’s do it. Basics are:
- dp = 0.21 mm
- ρp = 3300 kg/m3
- ρg = 1.13 kg/m3
- μg = 1.77 × 10-5 Pa-s
- domain:
0.075 0.225 0.075m - mesh:
192 576 192(level 0, static; ≈21M cells) - doubly periodic in x and z
mi/poin y- Uin = 0.555 m/s
- ic: φ = 0.4 for y ≤ 0.075 m (≈30M particles)
A couple of notes:
- One of the biggest parameters affecting the time to solution will be the DEM dt. It seems like we won’t be able to get a one-to-one comparison here due to differences in the numerical method. Here, I’ve set (DEM) dt = 2 × 10-6 s with 32 subcycling steps which gives a spring constant of kn = 19.3 N/m.
- Originally the domain had an aspect ratio of 2 in the reference paper. I followed Bruno’s setup w/ aspect ratio of 3. Extending to 4 might be even better at this high superficial velocity.
- Drag law is
BVK2(Tang et al. 2015) - I wasn’t sure what the collision parameters were supposed to be so I guessed 0.9 for restitution, 0.3 for kinetic friction and ignored rolling friction. The first two ended up being correct but he used rolling friction, coefficient of 0.2 with the EPSD model (
I’m not sure what that is…) - It looks like he reflected particles at the outlet to maintain inventory, so I did that too.
- Particles were seeded in
hcp(hex close packed) which produced about 36.4M. - The grid decomposition is
32 16 32which gives 1296 grids which is 1:1 mapped to 1296 CPUs. He mentioned running on approx. 1000 cores, but we need nice powers of base 2. - I’m not diffusing solids information, just using the “standard” (to MFIX-Exa) 8 cell, tri-linear deposition.
inputs_pdfb.txt (8.4 KB)
Here’s the inputs. Case was run to a simulation time of 1 s. Checkpointing every 4 hours; not saving plot files; not pulling any data through monitors; ascent vis is the only output. Time to solution was 59580 s ≈ 16.5 hours. Here’s the animation
. I also blindly decomposed this into grids of 64 cubed and ran on one GPU node (8 H100s managing about 10 grids each). The time to solution was 39454 s ≈ 11 hrs. Not bad.
It’s interesting that, for us, the particles get stuck at the outlet because they slammed into it so forcefully and uniformly. I’ve seen this a number of times before. Often leads to a crash. I’m a little surprised the code chewed through this. Even beyond plugging up the po, the dynamics of the remaining bed look quite a bit different qualitatively than predicted by Lethe. It would be nice to have some quantitative metrics to compare.
As a bit of a follow up, I couldn’t help but notice this is not how I would have set the problem up. Specifically, I think the grid resolution is finer than it needs to be for unresolved CFD-DEM, dx/dp ≈ 1.86. If we reduce the mesh to 128 384 128 we still have a reasonable resolution, dx/dp ≈ 2.8, with nice base mesh and the slightly larger dx should give a slightly larger fluid dt which is CFL limited to 0.9 with MFIX-Exa’s default Godunov scheme. I did a quick and dirty strong scaling for just the initial 10 steps using only cubic grids from 128 to 8. The sweet spot for CPUs is usually about 16 cubed, but I just wanted to see how far up the Yupeng hill that put me for this case.
As expected, 16 cubed is off ideal but still reasonable. You can still get a almost 2x speed up going down to 8 cubed but now you’re running on 12k CPUs, a considerable expense. So, I ran this case w/ 16 cubed grids 1:1 mapped to 1536 CPUs. At first glance this might seem like a lot more CPUs than before, but it perfectly fills 12 nodes (128 CPUs per node for us), whereas the previous distribution still needed 11 with some unused cores. When I first ran this case, it crashed when the slug slammed into the
po It’s a bit unfortunate that this case is so dependent on the initial configuration (
ic.init-bed.granular_temperature = 1.0e-2
This worked. It ran. The time to solution is reduced to 45127 s ≈ 12.5 hrs. It’s not substantially faster, but a bit of an improvement for only one extra node.
I also did some quick GPU timing for the 10 initial steps. (Because the original case above was already running, I only had one node/8H100s to use for this testing.) The results are summarized in the table below.
| gpus | grids | nx | ny | nz | time (s) |
|---|---|---|---|---|---|
| 1 | 1 | 128 | 384 | 128 | 72.2 |
| 2 | 2 | 128 | 384 | 64 | 46.8 |
| 3 | 3 | 128 | 128 | 128 | 71.1 |
| 6 | 6 | 128 | 64 | 128 | 44.8 |
| 4 | 4 | 64 | 384 | 64 | 33.0 |
| 8 | 24 | 64 | 64 | 64 | 56.0 |
Not surprisingly, a 1:1 distribution mapping is better than making the GPUs manage multiple grids. I ran this case (including the initial perturbation) on 4 H100s with the 2-1-2 decomposition. The time to solution was 33205 s ≈ 9.2 hrs. Even if we just said the time between CPU and GPU solution was approximately the same, I wish I could put a $ value on 1536 AMD CPUs vs 4 nVidia GPUs, both MSRP and cost of ownership, to know what’s the better solution strategy. But that’s beyond my purview. Oh well. Animation below. Notice no slug sticking to po this time
.

