[Openmcl-devel] add-gc-hook / drain-termination-queue

Gary Byers gb at clozure.com
Sat Sep 7 03:04:21 PDT 2013



On Sat, 7 Sep 2013, Carlos Ungil wrote:

> Hi Gary,
>
> On 06/09/2013, at 19:06, Gary Byers <gb at clozure.com> wrote:
>> CCL::ADD-GC-HOOK isn't documented for a number of reasons; as you're
>> calling it, it doesn't do anything useful. [....]
>> I don't know what problem you're having, but it's not clear that causing that
>> termination function to run on a different thread fixes anything.
>
>
> I'll try to explain better why add-gc-hook is doing something useful for me (even if I'm not completely sure it is the right solution for my problem).

Start by trying to explain how it's doing anything at all.  Start that
by looking at the source code for CCL::ADD-GC-HOOK.  Note that it takes
an optional argument which determines whether the provided function is
called via the "pre-gc-hook" or the "post-gc-hook", and that this defaults
to affecting the pre-gc-hook.

Which isn't implemented, and as far as I know never has been.

If you don't believe me:

? (ccl::add-gc-hook (lambda () (format t "~&Yow!  Are we GCing yet?") (force-output t)))
#<Anonymous Function #x30200073609F>
? (gc)
NIL
? (gc)
NIL
? (gc)
NIL
? (gc)
>

You're disabled the automatic termination that CCL does ordinarily, and
replaced it with something that's never called.  I think that you should
be very suspicious of any conclusions that you might have drawn at this
point: what's been happening and what you think has been happening are
likely very different things.

The failure that you're seeing is apparently caused by two threads trying
to modify some global data structure(s) at essentially the same time, each
thread acting as if it has exclusive access to that data structure(s).  You're
trying to eliminate that problem by ensuring that all such modifications happen
on the same thread.  That could be made to work, but there are some subtle (and
some not-so subtle) problems with this approach.

A more general (and usually more robust) way to serialize access to some
shared data structure(s) is to use a lock to ensure that at most one thread
is modifying that data structure at any point in time.  If RF-PROTECT and
RF-UNPROTECT-PTR modify some (foreign) data structure and if they happen
to be the only lisp-callable functions that do so, then something like:

(defvar *r-protect-lock* (make-lock))

(defun rf-protect (ptr)
   (with-lock-grabbed (*r-protect-ptr*)
    ;; do the foreign function call here
))

(defun rf-unprotect-ptr (ptr)
   (with-lock-grabbed (*r-protect-ptr*)
    ;; do another foreign function call here
))

would (assuming that I spelled/parenthesized things correctly and that the missing
code is addded) avoid the resource contention probems that you're experiencing.


> Remember that when r-pointer objects are created the rf-protect foreign function is called. And the finalizer will take care of calling the rf-unprotect-ptr foreign function later (by default from a different thread).
>
> BASE CASE
>
>  (ql:quickload :rcl)
>  (rcl:r-init)
>  (reduce #'+ (loop repeat 10000 append (rcl:r "runif" 1)))
>
>    Warning: stack imbalance in 'c', 1522 then 1521
>    Error: unprotect_ptr: pointer not found
>    Execution halted
>
> The outcome is non-deterministic: there might be several warnings before the error, or even no error printed, but in any case the lisp process doesn't respond.
>
> My interpretation: the garbage collector is triggered and objects finalized in a background thread, resulting in concurrent calls to rf-protect and rf-unprotect-ptr.
> The R process can't handle that, and eventually it crashes.
>
> SIMPLE SOLUTION: disable automatic termination before running the test, and rely on the user calling (ccl:drain-termination-queue) at appropriate times.
> Kind of works, but there is a problem if too many objects are protected before the user manually unprotects those that have been gc'ed.
>
>  (setq ccl:*enable-automatic-termination* nil)
>  (ql:quickload :rcl)
>  (rcl:r-init)
>  (reduce #'+ (loop repeat 10000 append (rcl:r "runif" 1)))
>
>    Error: protect(): protection stack overflow
>    > Error: error calling textConnection
>    > While executing: RCL::R-FUNCALL, in process listener(1).
>    > Type :POP to abort, :R for a list of available restarts.
>    > Type :? for other options.
>
> In this case the lisp process can recover, trying to run R functions again the same error is raised until I call (ccl:drain-termination-queue)
>
> My interpretation: I'm hitting the limit in the protection stack in the R process but when I run drain-termination-queue the finalizers run correctly cleaning the stack.
>
> IMPROVED SOLUTION: run drain-termination-queue each time the garbage is collected, but do so from the main thread to be sure (?) that there are no concurrents calls to rf-protect.
>
>  (setq ccl:*enable-automatic-termination* nil)
>  (ccl::add-gc-hook
>   (let ((process ccl:*current-process*))
>     (lambda ()
>       (ccl:process-interrupt process #'ccl:drain-termination-queue)))
>   :post-gc)
>  (ql:quickload :rcl)
>  (rcl:r-init)
>  (reduce #'+ (loop repeat 10000 append (rcl:r "runif" 1)))
>
> My interpretation: Reading the documentation of process-interrupt I understand that drain-termination-queue is guaranteed to run at a time when no call to rf-protect is running (so the finalizers can run safely):
>  "It is still difficult to reliably interrupt arbitrary foreign code (that may be stateful or otherwise non-reentrant); the interrupt request is handled when such foreign code returns to or enters lisp."
> I'd say it solves my problem (using only one thread to interface with R-land seems an easy way avoid multi-thread issues) but I'm not sure whether disabling automatic-termination could also have undesirable side-effects.
>
>> You could force the termination function for R-POINTER objects to run
>> on some arbirtary thread P by making that function use PROCESS-INTERRUPT
>> (e.g.,
>> (trivial-garbage:finalize r-pointer?(lambda () (process-interrupt p (lambda () (rf-unprotect-ptr ptr)))))
>> )
>
> PROPOSED SOLUTION:  As I mentioned at the end of my previous message I also tried this solution but it doesn't work properly (I think it should: at worst I would have expected some impact on performance, given that there is an interruption for each object instead of one for each run of the garbage collector).
>
>  (ql:quickload :rcl)
>  (defun rcl::make-r-pointer (ptr)
>    (let ((r-pointer (make-instance 'rcl::r-pointer :pointer ptr)))
>      (rcl::rf-protect ptr)
>       (let ((process ccl:*current-process*))
>         (trivial-garbage:finalize r-pointer
>           (lambda () ;;; (princ ".")
>             (ccl:process-interrupt process #'rcl::rf-unprotect-ptr ptr))))
>      r-pointer))
>  (rcl:r-init)
>  (reduce #'+ (loop repeat 10000 append (rcl:r "runif" 1)))
>
> It "hangs" (*), unless the number of iterations is low enough (v.g. 1000, the exact threshold seems to depend on the system load).
> It works when the (princ ".") statement above is uncommented, and then I can run many more iterations (v.g. 100000) without problems.
>
> My interpretation: CCL has issues if too many process-interruptions are scheduled at the same time. Slowing down the program (by printing) keeps the number of pending interruptions manageable.
>
> Regards,
>
> Carlos
>
> (*) I'm running MacOSX, in Activity Monitor I see that the number of Mach Messages In/Out, System Calls and Context Switches keeps growing. The output of taking a "sample" follows:
>
> Sampling process 69796 for 3 seconds with 1 millisecond of run time between samples
> Sampling completed, processing symbols...
> Analysis of sampling dx86cl (pid 69796) every 1 millisecond
> Process:         dx86cl [69796]
> Path:            /Users/ungil/lisp/ccl/dx86cl
> Load Address:    0x11000
> Identifier:      dx86cl
> Version:         ??? (???)
> Code Type:       X86 (Native)
> Parent Process:  zsh [69790]
>
> Date/Time:       2013-09-07 04:42:02.105 +0200
> OS Version:      Mac OS X 10.8.4 (12E55)
> Report Version:  7
>
> Call graph:
>    2784 Thread_866305   DispatchQueue_1: com.apple.main-thread  (serial)
>    + 2784 start  (in dx86cl) + 40  [0x33ef4]
>    +   2784 _start  (in dx86cl) + 207  [0x33fc4]
>    +     2784 main  (in dx86cl) + 1310  [0x1c52e]  pmcl-kernel.c:2125
>    +       2784 func_start  (in dx86cl) + 94  [0x19f79]
>    +         2784 ???  (in <unknown binary>)  [0x387fbc]
>    +           2784 ???  (in <unknown binary>)  [0x387fbc]
>    +             2784 SPffcall  (in dx86cl) + 85  [0x1991d]  x86-spentry32.s:4326
>    +               2784 nanosleep  (in libsystem_c.dylib) + 375  [0x96a37c10]
>    +                 2784 mach_wait_until  (in libsystem_kernel.dylib) + 10  [0x9906c8e6]
>    2784 Thread_866312
>    + 2784 thread_start  (in libsystem_c.dylib) + 34  [0x96990d4e]
>    +   2784 _pthread_start  (in libsystem_c.dylib) + 344  [0x969a65b7]
>    +     2784 lisp_thread_entry  (in dx86cl) + 300  [0x2137c]  thread_manager.c:1676
>    +       2784 func_start  (in dx86cl) + 94  [0x19f79]
>    +         2784 SPffcall  (in dx86cl) + 85  [0x1991d]  x86-spentry32.s:4326
>    +           2784 __read  (in libsystem_kernel.dylib) + 10  [0x9906fdba]
>    2784 Thread_867787   DispatchQueue_2: com.apple.libdispatch-manager  (serial)
>      2784 _dispatch_mgr_thread  (in libdispatch.dylib) + 53  [0x92b477a9]
>        2784 _dispatch_mgr_invoke  (in libdispatch.dylib) + 993  [0x92b47c71]
>          2784 kevent  (in libsystem_kernel.dylib) + 10  [0x9906f9ae]
>
> Total number in stack (recursive counted multiple, when >=5):
>
> Sort by top of stack, same collapsed (when >= 5):
>        __read  (in libsystem_kernel.dylib)        2784
>        kevent  (in libsystem_kernel.dylib)        2784
>        mach_wait_until  (in libsystem_kernel.dylib)        2784
>
>
>
>> On Fri, 6 Sep 2013, Carlos Ungil wrote:
>>
>>> Hello,
>>> to ensure that finalizers are run from the main thread, I'm doing the
>>> following:
>>> (setq ccl:*enable-automatic-termination* nil)
>>> (ccl::add-gc-hook
>>> ?(let ((process ccl:*current-process*))
>>> ? ?(lambda ()
>>> ? ? ?(ccl:process-interrupt process #'ccl:drain-termination-queue)))
>>> ?:post-gc))
>>> Is this a good idea? The fact that add-gc-hook is not exported makes me
>>> doubt...
>>> The context is the following: RCL embeds R in Common Lisp (using CFFI, the
>>> rf-protect and rf-unprotect-ptr functions below are foreign functions). When
>>> I need to hold a foreign pointer in the lisp side I wrap it in an object: I
>>> "protect" it to prevent it from being garbage-collected in the R side and I
>>> set up a finalizer to "unprotect" it when it's no longer in use.?
>>> (defun make-r-pointer (ptr)
>>> ? (let ((r-pointer (make-instance 'r-pointer :pointer ptr)))
>>> ? ? (rf-protect ptr)
>>> ? ? (trivial-garbage:finalize r-pointer?(lambda () (rf-unprotect-ptr ptr)))
>>> ? ? r-pointer))
>>> However, a function triggering garbage collection like the following one
>>> crashes the program (because the protect and unprotect functions will run
>>> concurrently, I think).
>>> (ql:quickload :rcl)
>>> (rcl:r-init)
>>> (reduce #'+ (loop repeat 10000 append (rcl:r "runif" 1)))
>>> The workaround mentioned above seems to solve the problem, but I wonder if
>>> there might be some undesirable effects.?
>>> I also tried to let the automatic termination enabled, forcing the finalizer
>>> to run?each time?from the main thread
>>> ? ? (trivial-garbage:finalize r-pointer?(let
>>> ((process??ccl:*current-process*)) (lambda () (process-interrupt process
>>> #'rf-unprotect-ptr ptr))))
>>> but I couldn't make it work (the program hangs).
>>> Cheers,
>>> Carlos
>>>
>
>



More information about the Openmcl-devel mailing list