Different output on cluster and personal computer

Dear developers,
I debugged my own case on my own computer with 4 cores in dmp.Different errors can occur with different core assignments.When nodesi = 2,nodesj =1,nodesk =2, the output is as follows:

When nodesi = 1,nodesj = 4,nodesk =1,the output is as follows:

The des_usr_var(2) is in des_thermo_newvalues.f, I monitor the sum of des_usr_var(2).I upload my output in different cores settings.
4_pulse_nodesi=2nodesj=1nodesk=2.zip|attachment (54.9 MB)
4_pulse_nodesi=1nodesj=4nodesk=1.zip|attachment (55.5 MB)

How to eliminate the effect of core settings to ensure the accuracy of my output?

Thank you very much!

I see that you are running MFiX 21.4. The current release (22.2.1) has some improvements you may be interested in, for example the monitor output files now include information about the monitor type:

# 
# Run type: NEW
# Monitor type: Sum over region, Volume =   0.77861163E-02
# "Time","des_usr_var(2)" 

I am able to reproduce the result you report, where the value of des_usr_var(2) is dependent on the domain decomposition, using the current release.

When in doubt, it’s best to start with a serial (single CPU) run. You can rebuild the solver for serial, or just set nodesi = nodesj = nodesk = 1

Doing this I obtain the value 1509.79 at t=1ms which matches the nodesi = nodesk = 1, nodesj = 4 value you reported.

I don’t completely understand what you are doing in des_thermo_newvalues.f but I suspect that your algorithm is dependent on the domain decomposition.

	      !#############
	       do k = 1,MAX_PIP
		 yposition(k) = des_pos_NEW(k,2)
	       end do
	      !############
	       call QsortC(yposition)

        y_position_max = yposition(MAX_PIP-int(0.1*MAX_PIP))

Note that if you just want to find the max position, Qsort is overkill, but I don’t understand the MAX_PIP-int(0.1*MAX_PIP). It’s likely that there is a more efficient way to compute this, without the qsort.

However, the issue is that this y_position_max value is going to depend on the decomposition - when running on more than one node, each node can only ‘see’ a subset of the particles, so y_position_max will be dependent on the partition. I believe this is the source of the disparities you are seeing.

To check this hypothesis, add a debug statement:

        y_position_max = yposition(MAX_PIP-int(0.1*MAX_PIP))
        write(*,*) "DEBUG: MyPE=", myPE, "MAX_PIP=", MAX_PIP, "y_position_max=", y_position_max

(I also added use compar, only: myPE near the top of the file, to give access to the myPE variable. PE = ‘processing element’, myPE = index of current processor.)

Running this with nodes=1,1,1

DEM NITs: 3   Total PIP: 38558
 DEBUG: MyPE= 0  MAX_PIP= 39264  y_position_max= 6.2155804157893971E-002
 DEBUG: MyPE= 0  MAX_PIP= 39264  y_position_max= 6.2155803851199780E-002
 DEBUG: MyPE= 0  MAX_PIP= 39264  y_position_max= 6.2155803646343259E-002

All of the particles are being handled by PE 0.

Running with 1,4,1 we see this:

DEM NITs: 3   Total PIP: 38558
 DEBUG: MyPE= 2  MAX_PIP=     4   y_position_max= 0.0000000000000000     
 DEBUG: MyPE= 1  MAX_PIP=     4   y_position_max= 0.0000000000000000     
 DEBUG: MyPE= 3  MAX_PIP=     4   y_position_max= 0.0000000000000000     
 DEBUG: MyPE= 0  MAX_PIP= 39264   y_position_max= 6.2155803105279163E-002

Note that PE 0 has most of the particles, which makes sense - they are all clustered at the bottom of the container and we are partitioning vertically (nodesj = y axis)

Now with 2,1,2:

DEM NITs: 3   Total PIP: 38558
 DEBUG: MyPE= 0  MAX_PIP= 12240  y_position_max= 6.2155804157893978E-002
 DEBUG: MyPE= 1  MAX_PIP= 12187  y_position_max= 6.2155803851199787E-002
 DEBUG: MyPE= 3  MAX_PIP= 12126  y_position_max= 6.2155803851199780E-002
 DEBUG: MyPE= 2  MAX_PIP= 12229  y_position_max= 6.2155803851199787E-002

Now, since we are slicing along the X and Z axes, the particles are more evenly distributed among the 4 PEs. The values of y_position_max computed on the 4 PEs are close but not identical. If this is really supposed to be the maximum y position of ALL the particles in the system, you’re going to have to do some bcast or send_recv between the PEs to compute the global maximum.

Hope this helps,

– Charles

Thank you very much!Charles,I read your reply.that’s too diffcult for me now to use bcast or send_recv,I have not been exposed to these contents.

I try to use smp parallel to complete my project, I obtain the value 1509.79 at t=1ms.There is no problem as you mentioned above.Can you tell me why it don‘t happen?

When in doubt, use debug print statements, as in my example above.

Running with SMP (on 8 cores) I see:

 DEBUG: MyPE= 0  MAX_PIP= 39264 y_position_max= 6.2155456290930451E-002

in SMP parallelism, the particles are not partitioned so the qsort sees them all. Also note that you’re not going to get as much speedup with SMP as DMP because not all parts of the code are parallelizable. Running top I saw that the solver was uing about 350% CPU, even with 8 threads. That means really only about 3.5X faster than serial.

To speed things up, try doing a ‘Custom’ solver build with compiler flags -march=native -O3. We will offer this as a “Turbo” type build in a future MFiX release. This generates machine code that is faster but less portable because it is tuned to the CPU type it is running on. We typically see significant (~20%) speed increases with these flags.

Thank you very much! Charles.

Hi!
Charles,are there some examples about using bcast or send_recv between the PEs to compute the global maximum?

Fortunately, you do not have to use “raw” MPI send_recv commands, because MFiX has a small library of mpi utility functions, including global_all_max and global_all_min

Attached is a modified version of your des_thermo_newvalues.f which uses this.

Here’s the relevant section:

      IF(ENERGY_EQ) THEN
          y_position_max = maxval(des_pos_new(:,2))
          print*, "DEBUG: MyPE=", myPE, "MAX_PIP=", MAX_PIP, "y_position_max=", y_position_max
          call global_all_max(y_position_max)
          print*, "DEBUG: MyPE=", myPE, "global max=", y_position_max

A few notes:

I removed the qsort completely and am using the Fortran builtin maxval to find the maximum value of each segment. Then using global_all_max to find the maxium among the segments.

You can remove the “DEBUG” lines after verifying this works.

However your original code was doing this (after sorting)

    y_position_max = yposition(MAX_PIP-int(0.1*MAX_PIP))

which is not the maximum but the 90th percentile value. I don’t see an easy way to do this in a parallel fashion. If you really need this, then I think that all the y position data will have to be sent to one node (myPE=0) and the sorting done there, then the result broadcast back out to all nodes. This is a bit more complex (and will be slower) but is doable if needed.

– Charles
des_thermo_newvalues.f (5.3 KB)

Thank you very much! Charles,I need to find value which is not the maximum but the 90th percentile value.

That will require more work. It won’t do to find the maxiumum height and then multiply that by some empirical constant?

I think it’s don’t work.Now I try to use smp to complete my project.Thank you very much,charles.

@zxc - you are using the variable MAX_PIP where I believe you should be using PIP - note the file model/des/discretelement_mod.f where it is defined:

! PARALLEL PROCESSING: explanation of variables in parallel architecture
! pip - particles in each processor (includes the ghost particles)
! max_pip - maximum allocated particles in processor

! Number of particles in the system (current)
      INTEGER PIP
! Global sum of particles (excluding ghost) in the system
      INTEGER TOT_PAR
! Maximum particles permitted in the system at once
      INTEGER MAX_PIP

MAX_PIP is an array size but not all elements of the array are valid. I added a DEBUG print* to des_thermo_newvalues.f and this is what it tells me running with nodes=1,4,1:

 DEBUG MYPE=  0  PIP= 38558   MAX_PIP= 39264
 DEBUG MYPE=  2  PIP=     0   MAX_PIP=     4
 DEBUG MYPE=  3  PIP=     0   MAX_PIP=     4
 DEBUG MYPE=  1  PIP=     0   MAX_PIP=     4

The number 38558 matches the actual number of particles in the system:

and 0 makes more sense than 4 for the upper partitions. (4 is a minumum buffer size).

If you loop over all data from 1 to MAX_PIP you will be including some invalid data.

– Charles

I don’t think you can loop from 1 to PIP either, because the valid entries are not guaranteed to be at the beginning of the array. So you will need to do something like this example from des_granular_temperature.f

      DO LL = 1, MAX_PIP

! skipping ghost particles and particles that don't exist
         IF(IS_NONEXISTENT(LL)) CYCLE
         PC = PC + 1
         IF(IS_GHOST(LL) .OR. IS_ENTERING_GHOST(LL) .OR. IS_EXITING_GHOST(LL)) CYCLE

         GLOBAL_GRAN_ENERGY(:) = GLOBAL_GRAN_ENERGY(:) + &
            0.5d0*PMASS(LL)*(DES_VEL_NEW(LL,:)-DES_VEL_AVG(:))**2
         GLOBAL_GRAN_TEMP(:) = GLOBAL_GRAN_TEMP(:) + &
            (DES_VEL_NEW(LL,:)-DES_VEL_AVG(:))**2

         IF(PC .EQ. PIP) EXIT
      ENDDO
1 Like

Thank you,Charles. It has helped me so much, I never thought of that.

1 Like

@zxc - here’s a little DMP implementation of the sort and 90th percentile code.

In this implementation, each PE sends the y-positions of its data to the root node, which does the sorting then broadcasts the value (here called y_90) out to all the workers. There’s no distributed way to compute percentiles.

It uses the function global_all_sum to get the total number of particles. This is a convenience function defined in mpi_utility. The functions in this module are just wrappers to do common operations with MPI_ calls. It then uses MPI_gather and MPI_gatherv to collect the data segments.

You can read the documentation on MPI_gather and MPI_gatherv if you want to understand more how this works. MPI_gather retrieves a segment of data from each node and concatenates them in an array. In this case, the segment is just 1 number, the number of particles. The MPI_gather call has to execute on all nodes, not just the root - it will do different things on the different nodes.

After collecting the number of particles in each PE, we use MPI_gatherv to gather all the y-position data. MPI_gatherv is like MPI_gather but allows the length of the data segment to be different across nodes. The read counts and offsets are passed as array arguments.

The qsort is done only on the root node, and then bcast is used to broadcast the values. Again it’s important that the bcast at L147 is outside the if (mype == root) because the bcast must happen on all PEs for them to stay in sync.

I’ve verified this works with various domain decompositions and produces consistent results. After you’ve tested and understood it you can remove all the DEBUG lines.

Small note, there are gather and gatherv functions in mpi_utility but I used the “raw” MPI_gather and MPI_gatherv instead, partly because they are better-documented and partly because I couldn’t get the other versions to work. So this is a combination of MFiX-flavored MPI (the stuff coming from mpi_utility) and “raw” MPI (the calls starting with MPI_). Since the former is built on top of the latter, this is not a problem. But this could probably be cleaned up a bit.

Also note that I didn’t change anything below L152, you may still need to add some nonexistent/ghost checks there.

des_thermo_newvalues.f (7.9 KB)

Thank you very much, I have learned a lot from your code.Charles

Hello,Charles.I use your code,but my output is different.
Runing with 2,1,2,I see this:
image
Runing with 1,4,1,I see this:
image
The total of sum_solar_q is different,and the 90th percentile is different.I try to modified your code before L152,but I think your code is correct.Why this happen?

Thank you very much!Charles.

I am not able to reproduce this. I get the same results for 2,1,2 and 1,4,1

2,1,2:
212.txt (45.8 KB)

1,4,1
141.txt (53.1 KB)

1_pulse.zip (46.3 MB)

Charles,I modified the geometric features of the model.The results appear different.
Run with 1,4,1:


Run with 2,1,2:

There is a slight difference in the output data.

Also, during the run I encountered the following error:
b217cef2cd74e7d43f03a444760d2a6

Hi. I ran this case with nodes=1,4,1 and nodes=2,2,2. In both cases the computation of the 90th percentile resulted in the same value, 1.2155E-002.

The screen grabs you posted look like screenshots of monitor CSV files.
This seems to be a monitor of des_usr_var(2). It’s always better to attach files vs screen-grabs.

I can reproduce this result - with the 2,1,2 decomposition I get

# Run type: NEW
# Monitor type: Sum over region, Volume =   0.33080211E-01
# "Time","des_usr_var(2)" 
    0.0000000    ,    0.0000000    
   0.10444444E-02,    17416.242  

while with 1,4,1 I get

# Run type: NEW
# Monitor type: Sum over region, Volume =   0.33080211E-01
# "Time","des_usr_var(2)" 
    0.0000000    ,    0.0000000    
   0.10444444E-02,    17381.607    

It looks like somehow your computation of des_usr_var is dependent on the domain composition. This is not due to the 90th percentile calculation, however - if you look at y_90 it’s the same regardless of the nodes* setting.
You have to review the computation of des_usr_var

I suppose it’s also possible that there’s a problem in the monitor code for “Sum over region” in the case of DMP. To test that we can try setting des_usr_var(2,LL) to a known quantity (e.g. 1) and see if the sum depends on the nodes* setting.

If I set des_usr_var(2,LL) = 1 in des_thermo_newvalues.f then I get 38558 for the sum regardless of domain decomposition. So I think it has to be in your computation of des_usr_var.