From e0bb1b6836bd1a3d4fe2cd1820343b6d5ac43bc2 Mon Sep 17 00:00:00 2001
From: Hengjie Wang <hengjiew@uci.edu>
Date: Thu, 2 Dec 2021 17:34:32 -0600
Subject: [PATCH] Document options for "Greedy" load balancing and reducing
 ghost particles.

---
 docs/source/grid/DualGrid.rst              |  6 +--
 docs/source/grid/LoadBalancing.rst         |  6 ++-
 docs/source/inputs/InputsLoadBalancing.rst | 45 ++++++++++++----------
 docs/source/particles/ParticlesOnGpus.rst  | 26 ++++++++++---
 4 files changed, 53 insertions(+), 30 deletions(-)

diff --git a/docs/source/grid/DualGrid.rst b/docs/source/grid/DualGrid.rst
index de11d73..d7ed796 100644
--- a/docs/source/grid/DualGrid.rst
+++ b/docs/source/grid/DualGrid.rst
@@ -4,7 +4,7 @@
 .. role:: fortran(code)
    :language: fortran
 
-.. _ss:dual_grid:
+.. _sec:dual_grid:
 
 Dual Grid Approach
 ------------------
@@ -14,12 +14,12 @@ In MFiX-Exa the mesh work and particle work have very different requirements for
 Rather than using a combined work estimate to create the same grids for mesh and particle
 data, we have the option to pursue a "dual grid" approach.
 
-With this approach the mesh (:cpp:`MultiFab`) and particle (:cpp:`ParticleContainer`) data 
+With this approach the mesh (:cpp:`MultiFab`) and particle (:cpp:`ParticleContainer`) data
 are allocated on different :cpp:`BoxArrays` with different :cpp:`DistributionMappings`.
 
 This enables separate load balancing strategies to be used for the mesh and particle work.
 
-The cost of this strategy, of course, is the need to copy mesh data onto temporary 
+The cost of this strategy, of course, is the need to copy mesh data onto temporary
 :cpp:`MultiFabs` defined on the particle :cpp:`BoxArrays` when mesh-particle communication
 is required.
 
diff --git a/docs/source/grid/LoadBalancing.rst b/docs/source/grid/LoadBalancing.rst
index cae0845..a85b01a 100644
--- a/docs/source/grid/LoadBalancing.rst
+++ b/docs/source/grid/LoadBalancing.rst
@@ -11,7 +11,7 @@ Load Balancing
 
 The process of load balancing is typically independent of the process of grid creation;
 the inputs to load balancing are a given set of grids with a set of weights
-assigned to each grid. 
+assigned to each grid.
 
 Single-level load balancing algorithms are sequentially applied to each AMR level independently,
 and the resulting distributions are mapped onto the ranks taking into account the weights
@@ -28,3 +28,7 @@ Options supported by AMReX include:
 
 - Round-robin: sort grids and assign them to ranks in round-robin fashion -- specifically
   FAB ``i`` is owned by CPU ``i % N`` where N is the total number of MPI ranks.
+
+These methods work for both fluid and particle grids if dual-grid is enabled. MFiX-Exa also supports
+a Greedy load balancing algorithm for particle grids. It balances the particle counts per rank and
+aligns particle grids with fluid grids to minimize the data-transfer between two grids.
\ No newline at end of file
diff --git a/docs/source/inputs/InputsLoadBalancing.rst b/docs/source/inputs/InputsLoadBalancing.rst
index e3eba85..5bf2768 100644
--- a/docs/source/inputs/InputsLoadBalancing.rst
+++ b/docs/source/inputs/InputsLoadBalancing.rst
@@ -1,3 +1,6 @@
+.. role:: cpp(code)
+   :language: c++
+
 .. _Chap:InputsLoadBalancing:
 
 Gridding and Load Balancing
@@ -35,25 +38,27 @@ The following inputs must be preceded by "fabarray_mfiter" and determine how we
 
 The following inputs must be preceded by "particles"
 
-+-------------------+-----------------------------------------------------------------------+-------------+--------------+
-|                   | Description                                                           |   Type      | Default      |
-+===================+=======================================================================+=============+==============+
-| max_grid_size_x   | Maximum number of cells at level 0 in each grid in x-direction        |    Int      | 32           |
-|                   | for grids in the ParticleBoxArray if dual_grid is true                |             |              |
-+-------------------+-----------------------------------------------------------------------+-------------+--------------+
-| max_grid_size_y   | Maximum number of cells at level 0 in each grid in y-direction        |    Int      | 32           |
-|                   | for grids in the ParticleBoxArray if dual_grid is true                |             |              |
-+-------------------+-----------------------------------------------------------------------+-------------+--------------+
-| max_grid_size_z   | Maximum number of cells at level 0 in each grid in z-direction        |    Int      | 32           |
-|                   | for grids in the ParticleBoxArray if dual_grid is true.               |             |              |
-+-------------------+-----------------------------------------------------------------------+-------------+--------------+
-| tile_size         | Maximum number of cells in each direction for (logical) tiles         |  IntVect    | 1024000,8,8  |
-|                   | in the ParticleBoxArray if dual_grid is true.                         |             |              |
-+-------------------+-----------------------------------------------------------------------+-------------+--------------+
++----------------------+-----------------------------------------------------------------------+-------------+--------------+
+|                      | Description                                                           |   Type      | Default      |
++======================+=======================================================================+=============+==============+
+| max_grid_size_x      | Maximum number of cells at level 0 in each grid in x-direction        |    Int      | 32           |
+|                      | for grids in the ParticleBoxArray if dual_grid is true                |             |              |
++----------------------+-----------------------------------------------------------------------+-------------+--------------+
+| max_grid_size_y      | Maximum number of cells at level 0 in each grid in y-direction        |    Int      | 32           |
+|                      | for grids in the ParticleBoxArray if dual_grid is true                |             |              |
++----------------------+-----------------------------------------------------------------------+-------------+--------------+
+| max_grid_size_z      | Maximum number of cells at level 0 in each grid in z-direction        |    Int      | 32           |
+|                      | for grids in the ParticleBoxArray if dual_grid is true.               |             |              |
++----------------------+-----------------------------------------------------------------------+-------------+--------------+
+| tile_size            | Maximum number of cells in each direction for (logical) tiles         |  IntVect    | 1024000,8,8  |
+|                      | in the ParticleBoxArray if dual_grid is true.                         |             |              |
++----------------------+-----------------------------------------------------------------------+-------------+--------------+
+| reduceGhostParticles | whether to remove unused ghost particles                              |    Bool     | false        |
++----------------------+-----------------------------------------------------------------------+-------------+--------------+
 
-Note that when running a granular simulation, i.e., no fluid phase, :cpp:`mfix.dual_grid` must be 0. Hence, 
-the :cpp:`particles.max_grid_size` (in each direction) have no meaning. Therefore the fluid grid and tile 
-sizes should be set for particle load balancing. It may also be necessary to set the blocking factors to 1. 
+Note that when running a granular simulation, i.e., no fluid phase, :cpp:`mfix.dual_grid` must be 0. Hence,
+the :cpp:`particles.max_grid_size` (in each direction) have no meaning. Therefore the fluid grid and tile
+sizes should be set for particle load balancing. It may also be necessary to set the blocking factors to 1.
 
 
 The following inputs must be preceded by "mfix" and determine how we load balance:
@@ -66,11 +71,11 @@ The following inputs must be preceded by "mfix" and determine how we load balanc
 | load_balance_fluid   | Only relevant if (dual_grid); if so do we also regrid mesh data       |  Int        | 1            |
 +----------------------+-----------------------------------------------------------------------+-------------+--------------+
 | load_balance_type    | What strategy to use for load balancing                               |  String     | KnapSack     |
-|                      | Options are "KnapSack"or "SFC"                                        |             |              |
+|                      | Options are "KnapSack", "SFC", or "Greedy"                            |             |              |
 +----------------------+-----------------------------------------------------------------------+-------------+--------------+
 | knapsack_weight_type | What weighting function to use if using Knapsack load balancing       |  String     | RunTimeCosts |
 |                      | Options are "RunTimeCosts" or "NumParticles""                         |             |              |
 +----------------------+-----------------------------------------------------------------------+-------------+--------------+
-| knapsack_nmax        | Maximum number of grids per MPI process if using knapsack algorithm   |  Int        | 128          | 
+| knapsack_nmax        | Maximum number of grids per MPI process if using knapsack algorithm   |  Int        | 128          |
 +----------------------+-----------------------------------------------------------------------+-------------+--------------+
 
diff --git a/docs/source/particles/ParticlesOnGpus.rst b/docs/source/particles/ParticlesOnGpus.rst
index c62bbf0..af0bfe4 100644
--- a/docs/source/particles/ParticlesOnGpus.rst
+++ b/docs/source/particles/ParticlesOnGpus.rst
@@ -1,3 +1,6 @@
+.. role:: cpp(code)
+   :language: c++
+
 Particles on GPUs
 ==========================
 
@@ -8,9 +11,9 @@ The core components of the particle method in MFIX-Exa are:
 - Particle-Particle Collisions
 - Particle-Wall Collisions
 
-Of these operations, the neighbor list construction requires the most care. 
-A neighbor list is a pre-computed list of all the neighbors a given particle can interact with over the next *n* timesteps. 
-Neighbor lists are usually constructed by binning the particles by an interaction distance, 
+Of these operations, the neighbor list construction requires the most care.
+A neighbor list is a pre-computed list of all the neighbors a given particle can interact with over the next *n* timesteps.
+Neighbor lists are usually constructed by binning the particles by an interaction distance,
 and then performing the N\ :sup:`2` distance check only on the particles in neighboring bins. In detail, the CPU version of the neighbor list algorithm is as follows:
 
 - For each tile on each level, loop over the particles, identifying the bin it belongs to.
@@ -31,7 +34,7 @@ The final on-grid neighbor list data structure consists of two arrays. First, we
 
 .. code-block:: c
 
-       // now we loop over the neighbor list and compute the forces                                                                                                                                                
+       // now we loop over the neighbor list and compute the forces
         AMREX_FOR_1D ( np, i,
         {
             ParticleType& p1 = pstruct[i];
@@ -50,9 +53,20 @@ The final on-grid neighbor list data structure consists of two arrays. First, we
 
 Note that, because of our use of managed memory to store the particle data and the neighbor list, the above code will work when compiled for either CPU or GPU.
 
-The above algorithm deals with constructing a neighbor list for the particles on a single grid. When domain decomposition is used, one must also make copies of particles on adjacent grids, potentially performing the necessary MPI communication for grids associated with other processes. The routines `fillNeighbors`, which computes which particles needed to be ghosted to which grid, and `updateNeighbors`, which copies up-to-date data for particles that have already been ghosted, have also been offloaded to the GPU, using techniques similar to AMReX's `Redistribute` routine. The important thing for users is that calling these functions does not trigger copying data off the GPU.
+The above algorithm deals with constructing a neighbor list for the particles on a single grid.
+When domain decomposition is used, one must also make copies of particles on adjacent grids,
+potentially performing the necessary MPI communication for grids associated with other processes.
+The routines `fillNeighbors`, which computes which particles needed to be ghosted to which grid, and `updateNeighbors`,
+which copies up-to-date data for particles that have already been ghosted, have also been offloaded to the GPU,
+using techniques similar to AMReX's `Redistribute` routine.
+
+Note that when the particles are dense, only a small portion of the ghost particles are neighbors to the particles
+inside the grids. AMReX provides a function `selectActualNeighbors` to filter out the ghost particles that will
+not be used for building the neighbor list. So the subsequent calls to `updateNeighbors` can avoid transferring
+these unused ghost particles, which significantly reduces the communication cost in some tests.
+To use this optimization, set `particles.reduceGhostParticles` to :cpp:`true` in MFiX-Exa's inputs.
 
-Once the neighbor list has been constructed, collisions with both particles and walls can easily be processed. 
+Once the neighbor list has been constructed, collisions with both particles and walls can easily be processed.
 
 MFiX-Exa currently runs on both CPU-only and hybrid CPU/GPU architectures.
 
-- 
GitLab