How to configure parallel computing in windows?

Hello developers, I would like to ask what else does windows system need to set for parallel computing? Why is it so inefficient to start parallel computing after I build solver?

Windows supports SMP but not DMP parallelism. If you can get OpenMPI running on Windows it might be possible to run in DMP mode but we do not support this at all.

In general we notice better performance on Linux than Windows (about 10-20%) but SMP on Windows should work. We cannot say anything more about the performance of your job without seeing some details. If you want us to take a look, please upload your input files.

I’m sorry to see your reply so late, IT diagnosed the windows system CPU running efficiency is no problem, now consider still may be in MFiX setup parallel computing problem, the following is my CASE with 48 cores, please help me to check what is the reason for the low efficiency of parallel computing, thank you very much!
fluid_bed_tfm_2d_2024-05-23T221018.950552.zip (30.7 MB)

hello, could you answer my question for me. Thank you very much.

Please be patient. I was out sick all last week and am still getting caught up. There are more of you than us!

1 Like

Hi @Liuyu

I had a chance to look at your case.

I see that you are running in SMP mode with 48 threads on Windows 10. I do not have a 48 core Windows machine available for testing, but I ran your job on 2 identical 8 core systems, one running Windows 10 and one running Linux. I set the number of SMP threads to 8.

Using top on Linux showed a CPU utilization of about 140%, this means that despite making 8 cores available, only about 1.4 were used (100% utilization means that one core is completely busy, so on an 8-core system the utilization could go as high as 800% if all cores were actively computing and not just waiting).

Using Task Manager on Windows showed only about 40% utilization, which is apalling - not even one core was fully used, so performance is worse than serial. I suspect this has to do with overhead of trying to schedule 48 threads - it is well-known that multithread scheduling is much more efficient on Linux than Windows which is why we suggest Linux for serious work (also of course Linux supports DMP which is not supported on Windows).

Here are plots of simulation time vs. real time for both … you can see how much slower progress is on Windows.

Linux, SMP, 8 threads

Windows, SMP, 8 threads

After about 2500 seconds, the Linux simulation is at about t=0.10 while Windows is just at t=0.06. (But neither of these is very impressive, compared to the serial solver)

Parallel computing is really as much of an art as a science and you can’t always get performance gains by simply throwing more threads at a problem. Some algorithms are more amenable to SMP approach than others. In general we find better returns from using a spatial domain decomposition and using DMP - SMP does not subdivide the domain and only parts of the solver code are parallelizable - look for comments like !$omp parallel in the MFiX source for more.

perf top (a great performance monitoring tool, only available on Linux) doesn’t show any obvious “hotspots” for optimization … I see the code is spending about 6% of its time computing dot_product, that is the most-called function, but since it’s only 6% of the total, there’s little to be gained by optimizing it (or any other single function). Instead, you may want to change some of the parameters in Numerics to loosen convergence criteria, try a preconditioner, different linear solver, etc.
Look at the residuals and see which is the maximum, this will help you understand why it’s taking so long to converge on each timestep. But do this experimenting with a serial (non-SMP) solver, not the 48-core SMP, which is just going to get bogged down in thread scheduling overhead as discussed above.

If you move to Linux, DMP will probably get you better results (again, it depends a lot on the individual case) and will let you scale to a higher number of cores than are available on a single machine (if you have a compute cluster at your disposal).

Please let us know what you find, and don’t wait 4 months to reply this time! :slight_smile:

Hope this helps,
– Charles

1 Like

Hi Charles,

Thank you so much for your testing!

I am working with @Liuyu on these cases.

I have tested another, but very similar, case on different devices, including a desktop with Windows 10, the same desktop with WSL on Windows 10, and also on Compute Canada (HPC).

Indeed, on the Linux system, the computational efficiency is the highest, and it can basically utilize all cores. In Windows 10, the computational efficiency is lower than on Linux, but at that time, I could utilize about 50-70% of the computational resources (with 100% meaning all cores are being used).

I have asked @Liuyu or the same test case as yours and will try to retest it on my devices as soon as possible. I will report the results here. Of course, I will also suggest to @Liuyu that the server system be switched to Linux—this is the best solution.

Regards,
Zach

In the interest of correctness/completeness I think I was misreading the results from Task Manager on Windows. The 40% seems to be scaled to the total number of physical CPUs (not hyperthreads) - my system has 4 cores and 8 threads, 100% utilization of one thread reads as 25% in Task Manager and 100% in Linux. In fact, I haven’t been able to find authoritative documentation about what the numbers in Task Manager actually mean and how they are scaled but empirically it seems that a single-threaded program never goes above 25%. This does not affect the conclusion that SMP works much better on Linux than Windows.

@cgw Hello Charles,

Sorry for the delay in testing as I have been busy with moving recently. Today, I took some time to run a test using the case above.

I plan to run the test on my ThinkPad, which has 12 logical threads. I have installed MFiX 24.1.1 on both Windows 11 and WSL2 Ubuntu 24.

On Windows 11, there were issues during the build process that have not yet been resolved.

In Ubuntu under WSL2 on Windows 11, the compilation and execution were successful. The computational domain was divided as x=1, y=12, z=1.

The CPU utilization was high and stable: overall speed/base speed = 3.12 GHz / 2.21GHz = 141%.

Each logical thread utilized more than 100%. The specific results are shown in the figure below:


Best,
Zach

Thanks for sharing your results. We typically expect better performance using “native” Linux than WSL, but if you get good results with WSL then that’s an interesting data point.