Some bug about dmp parallelism

Please select the most relevant MFiX category: | Installation | How to | Bug report | Share | for this topic.
Hello,developers,I tried to modify the code in des_thermo_newvalues.f to achieve the granular source item input I wanted.I divided the particles into 10 regions based on their distance from the center of the circle. The region is shown in Figure 1.


Figure 1
I counted the total number of particles in each region at the initial moment.But the results under the dmp run are different from the results under the single-core run.
The result is:

dmp k1 k2 k3 k4 k5 k6 k7 k8 k9 k10
93 265 490 697 967 1144 1289 1528 1725 1240
128 368 544 688 992 1200 1448 1584 1695 1194
112 320 560 752 976 1168 1341 1617 1749 1205
93 294 504 728 888 1160 1336 1563 1741 1172
426 1247 2098 2865 3823 4672 5414 6292 6910 4811 38558
single-run 432 1264 2112 2870 3811 4643 5382 6265 6895 4884 38558

Under dmp run, my core is set to 2,1,2.k1,k2,k3,k4,k5,k6,k7,k8,k9,k10 are the number of particles in different regions under the current core.I counted them in line 141 of des_thermo_newvalues.f.As you can see, the results are different, is this a bug in the dmp run?
2_umf_2022-11-14T041121.280165.zip (14.5 MB)

Also,running to 0.73s my calculation gave the following error.
1668427829783
Thank you very much if someone can answer my questions.

As explained in Different cores setting have different results - #2 by jeff.dietiker

“It is possible you will get different particle position over time due to the difference in order of operation.”

Since the totals match, and the other numbers are fairly close, I think that this does not indicate a problem, just normal variation.

The core dump on the other hand indicates that something has gone wrong. Can you copy/paste the last part of the output from the console, including the whole stack trace (not just a screenshot?) Thanks.

@zxc I am able to reproduce the failure here, although it takes several hours of running. The error I’m seeing is a bit unusual:

Error: Solver crash!
munmap_chunk(): invalid pointer
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0 des_thermo_newvalues_mod_MOD_des_thermo_newvalues
        at 2_umf_2022-11-14T041121.280165/des_thermo_newvalues.f:654
#1 des_time_march_MOD_des_time_step
        at des/des_time_march.f:216
#2 run_dem
        at mfix.f:211
#3 run_mfix
        at mfix.f:146
#4 main_MOD_run_mfix0
        at main.f:79

Usually, when mfix crashes, it is with SIGFPE (floating pointe error, i.e. zero division or math overflow) or SIGSEGV (invalid pointer access, typically due particles leaving the domain). In contrast, SIGABRT is relatively rare.

The error is reported on line 654 of des_thermo_newvalues.f in the project directory, that is, your code, which is your job to debug :slight_smile: But this is somewhat unusual:

652      RETURN
653
654   END SUBROUTINE DES_THERMO_NEWVALUES
655
656 END MODULE DES_THERMO_NEWVALUES_MOD

note that nothing is really happening on line 654 … hmm …

Going back to the original error message, recall that it said munmap_chunk(): invalid pointer and a little bit of searching for that term reveals that this is typically due to an error in freeing allocated memory (“munmap” is a clue, it means “unmap memory”) - it seems some allocated memory is getting deallocated twice, or deallocate is getting called with a bogus pointer value. This may be happening due to some automatic cleanup which occurs on exiting the subroutine (?) - it may be a bug in openMPI itself (??) - I’ll let you know if I get any further debugging this! And please let me know if you figure it out.

– Charles

Hello,charles.
My code has some errors, when I allocate these arrays I forgot to deallocate them, I now add these deallocations.
1668561524733
I’m re-running the program, I’ll let you know afterwards if there are any errors, thanks a lot, charles.

Hello,charles.
I reran the code, but it output new error.

corrupted size vs. prev_size

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x7fa258eea08f in ???
	at /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
#1  0x7fa258eea00b in __GI_raise
	at ../sysdeps/unix/sysv/linux/raise.c:51
#2  0x7fa258ec9858 in __GI_abort
	at /build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c:79
#3  0x7fa258f3426d in __libc_message
	at ../sysdeps/posix/libc_fatal.c:155
#4  0x7fa258f3c2fb in malloc_printerr
	at /build/glibc-SzIz7B/glibc-2.31/malloc/malloc.c:5347
#5  0x7fa258f3c96a in unlink_chunk
	at /build/glibc-SzIz7B/glibc-2.31/malloc/malloc.c:1454
#6  0x7fa258f3de8a in _int_free
	at /build/glibc-SzIz7B/glibc-2.31/malloc/malloc.c:4342
#7  0x7fa20c1ebce3 in __des_thermo_newvalues_mod_MOD_des_thermo_newvalues 
    at /home/u/1_5_umf/des_thermo_newvalues.f:359
#8  0x7fa20c7b2d42 in __des_time_march_MOD_des_time_step
	at/home/u/anaconda3/envs/mfix-21.4/share/mfix/src/model/des/des_time_march.f:201
#9  0x7fa20c4e9bc0 in run_dem
	at /home/u/anaconda3/envs/mfix-21.4/share/mfix/src/model/mfix.f:211
#10  0x7fa20c4e9aae in run_mfix_
	at /home/u/anaconda3/envs/mfix-21.4/share/mfix/src/model/mfix.f:146

It seems to be an error when the program deallocate the array y_pos1.I uploaded the new des_thermo_newvalues.f.My next step will be to change the y_pos dynamic array to a static array and perform the calculation, but the result seems to be different.
des_thermo_newvalues.f (38.5 KB)

My suggestions when debugging:

  1. Simplify the problem as much a possible: Decrease the number of particles, Decrease the number of rings
  2. Visualize the data. Look at where the particles are located with different partitions and if the number of particles reported by the code match the visual inspection. This is doable if you can get down to a small number of particles (see above).
  3. Sometimes it is better to start from scratch and add new pieces of code one at a time, rather than trying to debug too complex of a code. Add one or two new arrays (allocate/deallocate) at a time until it crashes. That way it will be easy to find out if you forgot to deallocate one.

Thank you very much,Jeff!Your suggestions are very helpful to me