High particle count leading to long runtime

Apex · September 25, 2025, 10:13am

Hello everyone,
I’m a beginner. While following the content of section 3.10. Procedural geometry in the official tutorial, I noticed that the calculation time became very long after I modified the diameter. I changed the particle diameter from 0.005m in the tutorial to 0.001m. However, there wasn’t much improvement in computation time, whether using SMP or DMP mode.

My CPU is an AMD Ryzen Threadripper PRO 5595WX (64 Cores), and I have 256GB of RAM. Since I don’t have another computer to verify this, I’m wondering if the slow calculation speed is due to the smaller particle size, or if there are some settings that I haven’t configured correctly.
0921-3D-DEM.zip (95.7 MB)

wfullmer · September 25, 2025, 12:55pm

chris-farley-shocked

Apex · September 25, 2025, 3:11pm

This computer’s performance should be fairly capable. However, the strange thing is that after reducing the particle diameter to 0.001m, the calculation time increased dramatically from tens of minutes directly to several days, as shown in the figure. The only parameter I modified was the particle diameter.

wfullmer · September 25, 2025, 3:14pm

see Eq. (32)

cgw · September 25, 2025, 5:08pm

A few comments:

Reducing the particle diameter by a factor of 5 increases the number of particles by a factor of 125. MFiX 25.3 (released today) shows the particle count in the dashboard view:

Diameter 0.005:

Diameter 0.001:

The number of particle/particle collisions to check scales (roughly) with the square of the number of particles. So an increase of 4 minutes to 1 day is not unreasonable (factor of approx 325).

Throwing more CPUs at the problem doesn’t always make it go faster.

I don’t have quite the system you do - I have a 32-core AMD Ryzen.
Running with the larger particle size, in serial mode I get:

Simulation start time = 0.0000s
Simulation time reached = 1.002s
Elapsed real time = 4.314m
Time spent in I/O = 0.4849s

and with 16-core DMP:

Simulation start time = 0.0000s
Simulation time reached = 1.000s
Elapsed real time = 2.754m
Time spent in I/O = 1.105s

So 16-core DMP only got me a speedup of 1.56x.
(For comparison, 16-core SMP resulted in a speedup of 1.92)

Looking at “perf top” during the run:

Serial:

     8.14%  mfixsolver.so  [.] __calc_force_dem_mod_MOD_calc_force_dem
     7.02%  mfixsolver.so  [.] __calc_collision_wall_MOD_calc_dem_force_with_wall_stl
     5.98%  mfixsolver.so  [.] __leqsol_MOD_leq_matvec
     4.79%  mfixsolver.so  [.] __des_drag_gp_mod_MOD_des_drag_gp
     4.76%  mfixsolver.so  [.] __leqsol_MOD_dot_product_par
     2.81%  mfixsolver.so  [.] __wrap_pow
     2.80%  mfixsolver.so  [.] __particles_in_cell_mod_MOD_particles_in_cell
     2.68%  mfixsolver.so  [.] __drag_gs_mod_MOD_drag_syam_obrien
     2.67%  mfixsolver.so  [.] remove_collision.0.constprop.0.isra.0
     2.56%  mfixsolver.so  [.] __drag_gs_des1_mod_MOD_drag_gs_des1
     2.47%  libc.so.6      [.] memset
     2.45%  libm.so.6      [.] pow
     2.01%  mfixsolver.so  [.] __cfnewvalues_mod_MOD_cfnewvalues
     1.96%  mfixsolver.so  [.] __leq_bicgs_mod_MOD_leq_bicgs0
     1.83%  mfixsolver.so  [.] __discretelement_MOD_cross
     1.58%  mfixsolver.so  [.] __get_stl_data_mod_MOD_move_is_stl
     1.45%  mfixsolver.so  [.] __desgrid_MOD_desgrid_neigh_build

16-core DMP:

    27.62%  libopen-pal.so.80.0.5  [.] mca_btl_sm_component_progress
     6.51%  mfixsolver_dmp.so      [.] __leqsol_MOD_leq_matvec
     3.66%  libopen-pal.so.80.0.5  [.] opal_progress
     3.57%  libmpi.so.40.40.7      [.] mca_part_persist_progress
     2.90%  mfixsolver_dmp.so      [.] __get_stl_data_mod_MOD_move_is_stl
     2.43%  libc.so.6              [.] memset
     2.29%  mfixsolver_dmp.so      [.] __particles_in_cell_mod_MOD_particles_in_cell
     1.96%  mfixsolver_dmp.so      [.] __calc_collision_wall_MOD_calc_dem_force_with_wall_stl
     1.94%  libopen-pal.so.80.0.5  [.] opal_timer_linux_get_cycles_sys_timer
     1.66%  mfixsolver_dmp.so      [.] __leq_bicgs_mod_MOD_leq_bicgs0
     1.52%  libc.so.6              [.] memmove
     1.45%  mfixsolver_dmp.so      [.] __desgrid_MOD_desgrid_pic
     1.34%  mfixsolver_dmp.so      [.] __calc_force_dem_mod_MOD_calc_force_dem
     1.25%  [kernel]               [k] _copy_to_iter
     1.16%  mfixsolver_dmp.so      [.] __leqsol_MOD_dot_product_par

All of the mca and opal functions are DMP overhead, which is taking over 30% of total CPU time.

Now with the smaller particle size, there are more particles crossing between DMP nodes and even more bookkeeping overhead:

    56.18%  libopen-pal.so.80.0.5  [.] mca_btl_sm_component_progress
     7.58%  libopen-pal.so.80.0.5  [.] opal_progress
     7.42%  libmpi.so.40.40.7      [.] mca_part_persist_progress
     3.92%  libopen-pal.so.80.0.5  [.] opal_timer_linux_get_cycles_sys_timer
     2.70%  mfixsolver_dmp.so      [.] __calc_force_dem_mod_MOD_calc_force_dem
     2.25%  libmpi.so.40.40.7      [.] ompi_request_default_wait
     1.82%  mfixsolver_dmp.so      [.] __desgrid_MOD_desgrid_neigh_build
     1.63%  mfixsolver_dmp.so      [.] __desgrid_MOD_desgrid_pic
     1.59%  mca_btl_smcuda.so      [.] mca_btl_smcuda_component_progress
     1.58%  mfixsolver_dmp.so      [.] __drag_gs_des1_mod_MOD_drag_gs_des1
     1.29%  libc.so.6              [.] memset
     1.02%  libm.so.6              [.] pow

As you can see, we’re spending more time in DMP overhead than anything else.

I notice that you are using 120 cores on a 64-physical core system via hyperthreading. This is not always the most efficient since the hyperthreads are sharing CPU cache, memory bandwidth, etc. You may want to compare results with and without hyperthreading.

If you are interested in performance analysis, I suggest you familairize yourself with the perf tool.

– Charles

cgw · September 26, 2025, 4:14pm

See also Simulation time for 1 mm diameter particles

Apex · September 27, 2025, 3:52pm

Charles,

Thank you for the suggestions. I followed your advice and disabled hyper-threading, then ran the simulation using 16, 32, 48, and 64 cores with the SMP model. However, it didn’t lead to a significant improvement in the run time. It seems the particle count is indeed the primary bottleneck.

Just to provide some context, the reason I need smaller particles is that I’m trying to simulate a cold-flow model of a Circulating Fluidized Bed (CFB). Most literature in this field uses particles sized in the hundreds of microns.

On a separate note, I’d like to ask for your advice on another matter. My computer is self-assembled, and I installed the operating system myself. Without access to a professional CFD expert for guidance, I’m aware that there might be necessary optimizations at the BIOS and OS level—like disabling hyper-threading, which you mentioned. Could you possibly share some other optimization recommendations you think might be beneficial?

Thanks again for your help.

– Apex

cgw · September 29, 2025, 2:27pm

Congratulations on setting up your own machine. Other than disabling hyperthreads I really cannot offer any other tuning suggestions because (A) I don’t have any and (B) it’s a bit off-topic for this forum.

You have a few options:

Reduce particle count
Try CGP (Coarse-Grained Particle) model
Just wait…
If you need to run simulations with a high particle count, and especially if you have a good GPU, you may want to look into MFIX-Exa which is designed for large-scale simulations. EXA does not have a graphical front-end and requires you to set up your simulation via files. And these are not compatible with “classic” MFiX. But you may be able to get better results.

Good luck!