Some bug about dmp parallelism

zxc · November 14, 2022, 12:14pm

Please select the most relevant MFiX category: | Installation | How to | Bug report | Share | for this topic.
Hello,developers,I tried to modify the code in des_thermo_newvalues.f to achieve the granular source item input I wanted.I divided the particles into 10 regions based on their distance from the center of the circle. The region is shown in Figure 1.

Figure 1
I counted the total number of particles in each region at the initial moment.But the results under the dmp run are different from the results under the single-core run.
The result is:

dmp	k1	k2	k3	k4	k5	k6	k7	k8	k9	k10
	93	265	490	697	967	1144	1289	1528	1725	1240
	128	368	544	688	992	1200	1448	1584	1695	1194
	112	320	560	752	976	1168	1341	1617	1749	1205
	93	294	504	728	888	1160	1336	1563	1741	1172
	426	1247	2098	2865	3823	4672	5414	6292	6910	4811	38558
single-run	432	1264	2112	2870	3811	4643	5382	6265	6895	4884	38558

Under dmp run, my core is set to 2,1,2.k1,k2,k3,k4,k5,k6,k7,k8,k9,k10 are the number of particles in different regions under the current core.I counted them in line 141 of des_thermo_newvalues.f.As you can see, the results are different, is this a bug in the dmp run?
2_umf_2022-11-14T041121.280165.zip (14.5 MB)

Also,running to 0.73s my calculation gave the following error.
1668427829783
Thank you very much if someone can answer my questions.

cgw · November 14, 2022, 5:24pm

As explained in Different cores setting have different results - #2 by jeff.dietiker

“It is possible you will get different particle position over time due to the difference in order of operation.”

Since the totals match, and the other numbers are fairly close, I think that this does not indicate a problem, just normal variation.

The core dump on the other hand indicates that something has gone wrong. Can you copy/paste the last part of the output from the console, including the whole stack trace (not just a screenshot?) Thanks.

cgw · November 15, 2022, 10:40pm

@zxc I am able to reproduce the failure here, although it takes several hours of running. The error I’m seeing is a bit unusual:

Error: Solver crash!
munmap_chunk(): invalid pointer
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0 des_thermo_newvalues_mod_MOD_des_thermo_newvalues
        at 2_umf_2022-11-14T041121.280165/des_thermo_newvalues.f:654
#1 des_time_march_MOD_des_time_step
        at des/des_time_march.f:216
#2 run_dem
        at mfix.f:211
#3 run_mfix
        at mfix.f:146
#4 main_MOD_run_mfix0
        at main.f:79

Usually, when mfix crashes, it is with SIGFPE (floating pointe error, i.e. zero division or math overflow) or SIGSEGV (invalid pointer access, typically due particles leaving the domain). In contrast, SIGABRT is relatively rare.

The error is reported on line 654 of des_thermo_newvalues.f in the project directory, that is, your code, which is your job to debug But this is somewhat unusual:

652      RETURN
653
654   END SUBROUTINE DES_THERMO_NEWVALUES
655
656 END MODULE DES_THERMO_NEWVALUES_MOD

note that nothing is really happening on line 654 … hmm …

Going back to the original error message, recall that it said munmap_chunk(): invalid pointer and a little bit of searching for that term reveals that this is typically due to an error in freeing allocated memory (“munmap” is a clue, it means “unmap memory”) - it seems some allocated memory is getting deallocated twice, or deallocate is getting called with a bogus pointer value. This may be happening due to some automatic cleanup which occurs on exiting the subroutine (?) - it may be a bug in openMPI itself (??) - I’ll let you know if I get any further debugging this! And please let me know if you figure it out.

– Charles

zxc · November 16, 2022, 1:19am

Hello,charles.
My code has some errors, when I allocate these arrays I forgot to deallocate them, I now add these deallocations.
1668561524733
I’m re-running the program, I’ll let you know afterwards if there are any errors, thanks a lot, charles.

zxc · November 16, 2022, 12:09pm

Hello,charles.
I reran the code, but it output new error.

corrupted size vs. prev_size

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x7fa258eea08f in ???
	at /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
#1  0x7fa258eea00b in __GI_raise
	at ../sysdeps/unix/sysv/linux/raise.c:51
#2  0x7fa258ec9858 in __GI_abort
	at /build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c:79
#3  0x7fa258f3426d in __libc_message
	at ../sysdeps/posix/libc_fatal.c:155
#4  0x7fa258f3c2fb in malloc_printerr
	at /build/glibc-SzIz7B/glibc-2.31/malloc/malloc.c:5347
#5  0x7fa258f3c96a in unlink_chunk
	at /build/glibc-SzIz7B/glibc-2.31/malloc/malloc.c:1454
#6  0x7fa258f3de8a in _int_free
	at /build/glibc-SzIz7B/glibc-2.31/malloc/malloc.c:4342
#7  0x7fa20c1ebce3 in __des_thermo_newvalues_mod_MOD_des_thermo_newvalues 
    at /home/u/1_5_umf/des_thermo_newvalues.f:359
#8  0x7fa20c7b2d42 in __des_time_march_MOD_des_time_step
	at/home/u/anaconda3/envs/mfix-21.4/share/mfix/src/model/des/des_time_march.f:201
#9  0x7fa20c4e9bc0 in run_dem
	at /home/u/anaconda3/envs/mfix-21.4/share/mfix/src/model/mfix.f:211
#10  0x7fa20c4e9aae in run_mfix_
	at /home/u/anaconda3/envs/mfix-21.4/share/mfix/src/model/mfix.f:146

It seems to be an error when the program deallocate the array y_pos1.I uploaded the new des_thermo_newvalues.f.My next step will be to change the y_pos dynamic array to a static array and perform the calculation, but the result seems to be different.
des_thermo_newvalues.f (38.5 KB)

jeff.dietiker · November 16, 2022, 11:26pm

My suggestions when debugging:

Simplify the problem as much a possible: Decrease the number of particles, Decrease the number of rings
Visualize the data. Look at where the particles are located with different partitions and if the number of particles reported by the code match the visual inspection. This is doable if you can get down to a small number of particles (see above).
Sometimes it is better to start from scratch and add new pieces of code one at a time, rather than trying to debug too complex of a code. Add one or two new arrays (allocate/deallocate) at a time until it crashes. That way it will be easy to find out if you forgot to deallocate one.

zxc · November 17, 2022, 9:36am

Thank you very much,Jeff！Your suggestions are very helpful to me