Why Elasped time and Time remaining becomes very large after a period of simulation?

cgw · October 19, 2023, 3:43pm

I do not believe that the incorrect walltime and remaining time estimates will affect your simulation results in any way, these numbers are just reported to you to monitor the progress of the simulation. We will get this fixed but for now it’s safe to ignore this.
The Unphysical field variables message is a real problem. MFiX has terminated because some quantity has gone out of the legal range - e.g. the temperature going outside the range tmin:tmax (50:4000K by default), or some pressure or density going negative, etc. You attached the mfx file but not the job logs etc. Use Submit bug report from the main menu to create a full bug report including job output. Better yet, examine the .LOG files yourself and see if you can figure out which variable it is. Also look at the “10 messages” reported in the popup, one of them might specify the field variable in question.
The Solver crash! message is inaccurate - the solver aborted deliberately when the check_data function found unphysical values. But it looks like the process does not exit cleanly when this happens, triggering a spate of MPI_FINALIZE warnings. These are ugly but also can be safely ignored. (I will try to clean up this exit path if I can). The only real issue here is #2 above, the Unphysical field variables… this is what is causing your simulation to terminate.

cgw · October 19, 2023, 4:22pm

I started reviewing the time-keeping code, I think it’s fair to say that some of this code is starting to show its age:

   double precision recursive function wall_time()

      implicit none

      INTEGER(KIND=8), SAVE :: COUNT_OLD=0, WRAP=0, COUNT_START=0
      LOGICAL, SAVE :: FIRST_CALL=.true.

      INTEGER(KIND=8)   CLOCK_CYCLE_COUNT, CLOCK_CYCLES_PER_SECOND

!                       max number of cycles; after which count is reset to 0
      INTEGER(KIND=8)   CLOCK_CYCLE_COUNT_MAX

      CALL SYSTEM_CLOCK(CLOCK_CYCLE_COUNT, CLOCK_CYCLES_PER_SECOND, CLOCK_CYCLE_COUNT_MAX)

      IF (FIRST_CALL) THEN
         FIRST_CALL = .false.
         COUNT_START = CLOCK_CYCLE_COUNT
      END IF

      IF(COUNT_OLD .GT. CLOCK_CYCLE_COUNT) THEN
!     This is unlikely. 64-bit INTEGER and 100 MHz CLOCK_CYCLES_PER_SECOND would mean 300 years until WRAP is incremented.
         WRAP = WRAP + 1
      ENDIF
      COUNT_OLD = CLOCK_CYCLE_COUNT

      WALL_TIME = DBLE(CLOCK_CYCLE_COUNT - COUNT_START)/DBLE(CLOCK_CYCLES_PER_SECOND) &
                + DBLE(WRAP) * DBLE(CLOCK_CYCLE_COUNT_MAX)/DBLE(CLOCK_CYCLES_PER_SECOND)
   end function wall_time

SYSTEM_CLOCK is an slighly outmoded way of getting the time, here is a reference SYSTEM_CLOCK (The GNU Fortran Compiler)

Determines the COUNT of a processor clock since an unspecified time in the past modulo COUNT_MAX, COUNT_RATE determines the number of clock ticks per second. If the platform supports a monotonic clock, that clock is used and can, depending on the platform clock implementation, provide up to nanosecond resolution. If a monotonic clock is not available, the implementation falls back to a realtime clock.

    CALL SYSTEM_CLOCK([COUNT, COUNT_RATE, COUNT_MAX])
Arguments:
    COUNT	(Optional) shall be a scalar of type INTEGER with INTENT(OUT).
    COUNT_RATE	(Optional) shall be a scalar of type INTEGER or REAL, with INTENT(OUT).
    COUNT_MAX	(Optional) shall be a scalar of type INTEGER with INTENT(OUT).

A small test program (using 64-bit INTEGERs) reports:

Count        16555615857496
Max     9223372036854775807
Hz               1000000000

Now that Max number looks suspiciously close to the walltime number in your screenshot, once we divide by Hz…
your reported walltime is 9223373246.

Using the units program for convenience:

bash$ units

You have: (9223372036854775807/1000000000) sec
You want: year
	* 292.27727

The wraparound time is on the order of 300 years, as the comment in the code indicates. So, the overflow check at line 73 is purely theoretical - it should really never trigger at all, and one could argue that it shouldn’t even be there, since there’s no way to test it (and now it’s misbehaving!)

    73	      IF(COUNT_OLD .GT. CLOCK_CYCLE_COUNT) THEN
    74	!     This is unlikely. 64-bit INTEGER and 100 MHz CLOCK_CYCLES_PER_SECOND would mean 300 years until WRAP is incremented.
    75	         WRAP = WRAP + 1
    76	      ENDIF
    77	      COUNT_OLD = CLOCK_CYCLE_COUNT

Note that if there is any non-monotonicity to the system clock, even by the tiniest amount, and the CLOCK_CYCLE_COUNT goes down, then WRAP will increment, and the reported time will leap forward by 292 years, exactly as we are seeing in your case! It’s also possible that there’s some sort of race condition with the COUNT_OLD code. But it’s clear that this flawed wraparound check is at the root of the problem.

Since the SYSTEM_CLOCK subroutine has various issues, I plan on replacing this code with something a little more modern, where you don’t have to worry about the system clock rate, overflows, unspecified starting time, etc.

In the meanwhile, if you like- open the file time_cpu_mod.f, comment out line 75 with a ! and rebuild the solver. This should fix your jumpy-clock problem.

Tongkun_Dai · October 20, 2023, 3:05am

OK. I will try as you say. Thank you very much for your answer!