Particles on GPUs
The particle kernels are compute-intensive and can in principle be processed asynchronously with parts of the fluid advance. The core components of the particle method in MFIX-Exa are:
Neighbor List Construction
Particle-Particle Collisions
Particle-Wall Collisions
Of these operations, the neighbor list construction requires the most care. A neighbor list is a pre-computed list of all the neighbors a given particle can interact with over the next n timesteps. Neighbor lists are usually constructed by binning the particles by an interaction distance, and then performing the N2 distance check only on the particles in neighboring bins. In detail, the CPU version of the neighbor list algorithm is as follows:
For each tile on each level, loop over the particles, identifying the bin it belongs to.
Add the particle to a linked-list for the cell that owns it.
For each cell, loop over all the particles, and then loop over all potential collisions partners in the neighboring cells.
If a collision partner is close enough, add it to that particle’s neighbor list.
To port this algorithm to the GPU, we use the parallel algorithms library Thrust, distributed as part of the CUDA Toolkit. Thrust provides parallel sorting, searching, and prefix summing algorithms that are particularly useful in porting particle algorithms. To construct the neighbor list on the GPU, we follow the basic approach used by Canaba, a product of the Particle Co-Design Center:
Sort the particles on each grid by bin, using a parallel counting sort. We use Thrust’s exclusive_scan function to implement the prefix sum phase of the sort, and hand-coded kernels for the rest. This step does not actually involving rearranging the particle data - rather, we compute a permutation that would put the particles in order without actually reordering them.
Once the particles are sorted by bin, we can loop over the particles in neighboring bins. We make two passes over the particles. First, we launch a kernel to count the number of collision partners for each particle.
Then, we sum these numbers and allocate space for our neighbor list.
Finally, we make a another pass over the particles, putting them into to list at the appropriate place.
Note that we build a full neighbor list, meaning that if particle i appears in particle j’s list, then particle j also appears in particle i’s list. This simplifies the force-computation step when using these lists, since the forces and torques for a given particle can be updated without atomics.
The final on-grid neighbor list data structure consists of two arrays. First, we have the neighbor list itself, stored as a big, 1D array of particle indices. Then, we have an offsets array that stores, for each particle, where in the neighbor list array to look. The details of this data structure have been hidden inside an iterator, so that user code can look like:
// now we loop over the neighbor list and compute the forces
AMREX_FOR_1D ( np, i,
{
ParticleType& p1 = pstruct[i];
p1.rdata(PIdx::ax) = 0.0;
p1.rdata(PIdx::ay) = 0.0;
p1.rdata(PIdx::az) = 0.0;
for (const auto& p2 : nbor_data.getNeighbors(i))
{
Real dx = p1.pos(0) - p2.pos(0);
Real dy = p1.pos(1) - p2.pos(1);
Real dz = p1.pos(2) - p2.pos(2);
...
}
Note that, because of our use of managed memory to store the particle data and the neighbor list, the above code will work when compiled for either CPU or GPU.
The above algorithm deals with constructing a neighbor list for the particles on a single grid. When domain decomposition is used, one must also make copies of particles on adjacent grids, potentially performing the necessary MPI communication for grids associated with other processes. The routines fillNeighbors, which computes which particles needed to be ghosted to which grid, and updateNeighbors, which copies up-to-date data for particles that have already been ghosted, have also been offloaded to the GPU, using techniques similar to AMReX’s Redistribute routine.
Note that when the particles are dense, only a small portion of the ghost particles are neighbors to the particles
inside the grids. AMReX provides a function selectActualNeighbors to filter out the ghost particles that will
not be used for building the neighbor list. So the subsequent calls to updateNeighbors can avoid transferring
these unused ghost particles, which significantly reduces the communication cost in some tests.
To use this optimization, set particles.reduceGhostParticles to true
in MFIX-Exa’s inputs.
Once the neighbor list has been constructed, collisions with both particles and walls can easily be processed.
MFIX-Exa currently runs on both CPU-only and hybrid CPU/GPU architectures.