[Openmcl-devel] Using suspend and resume (out of necessity)y

Fri Jul 5 19:53:34 PDT 2013

The way that PROCESS-SUSPEND is implemented is something like:

;;; each thread has two private semaphores; let's call them RESUME and ACK.

(progn
   (process-interrupt
    target
    (lambda ()
      (without-interrupts
        (signal-semaphore ack)  ; acknowledge request to suspend ourselves
        (wait-on-semaphore resume))))
   (wait-on-semaphore ack))

An important difference between the real code and this sketch is that
the actual mechanism doesn't involve PROCESS-INTERRUPT; it's very hard
for a thread to ignore or significantly defer a request to suspend itself.

There are some other differences, but both the actual code and the code
above are similar in that they increase the risk of deadlock.

A simple scenario that can lead to deadlock involves two threads (A and B)
and two locks (X and Y).  Thread A obtains the lock X and waits for the
lock Y; thread B obtains Y and waits for X, and both threads block forever.
More generally, a thread that owns something (a lock) that other threads
would have to wait for can't safely wait for something whose availability
other threads determine, and PROCESS-SUSPEND introduces such a scenario
and increases the risk of deadlock.

If a thread (your manager thread) suspends some other thread and is
responsible for eventually resuming it, then the suspending/manager
thread can't safely wait for a lock or similar resource since that
might be owned by the suspended/target thread.  This greatly reduces
the number of things that the "manager" thread can safely do.

I'm somewhat amazed that PROCESS-SUSPEND is exported/documented; I
think that happened before the risks involved in using it were clear.
Those risks have been clear for a long time, but the documentation
doesn't do a good job of scaring people out of using it.  (CCL's GC -
whatever thread the GC runs in - uses something like PROCESS-SUSPEND
to suspend other threads while the GC is running.  #_malloc and #_free
often use a lock to serialize access to their heap; the GC sometimes
frees foreign memory, and there have been some subtle and some not-so-subtle
bugs where the GC would try to free something when the #_malloc/#_free lock
was owned by a suspended thread.)

You asked whether locks were automatically released on suspension.
They aren't, and aside from the fact that it's probably not practical
to implement that (aside from locks made via MAKE-LOCK, the same
issues can apply to foreign locks/mutexes such as the one used by
#_maloc, etc.), I'm not sure that that sort of thing is desirable.  A
lock and a protocol based on a lock offer strong guarantees (if
everything that modifies a data structure does so while owning a lock
and a lock can only be released when the owning thread voluntarily
does so, code is likely a lot easier to reason about than it would be
if the lock can be released by anything that's afraid of deadlock.)

Although it shares many of the same issues as the actual implementation
of PROCESS-SUSPEND, the code sketched above avoids some of the problems
that you describe: a critical section of code can't be interrupted by
PROCESS-INTERRUPT if the section is protected by WITHOUT-INTERRUPTS (as it
should be), and calls to foreign code are effectively surrounded by
WITHOUT-INTERRUPTS.  (I don't understand your concern about PROCESS-KILL,
which uses PROCESS-INTERRUPT to tell the target thread to effectively
reset itself and run all pending UNWIND-PROTECt cleanup forms in the process.)

The code above has an additional problem, in that it's possible for
the target thread to receive an interrupt just before the
WITHOUT-INTERRUPTS takes effect.  CCL only uses PROCESS-INTERRUPT
internally in QUIT and SAVE-APPLICATION (to try to get other threads
to shut down), so if your application doesn't use it for other reasons
this shouldn't be a problem in practice.

The other problems - potential deadlock - remain.  If you can find a way
of avoiding the use of PROCESS-SUSPEND, it'd likely be worthwhile to do so.
If you can't find an alternative - if there's no other way to do what you
need to do - then I think that you just have to be aware of the need to
be very, very careful to avoid deadlock.  You certainly have enough rope
to hang yourself with; that's arguably better than not having enough rope.

On Fri, 5 Jul 2013, Florian Dietz wrote:

> Hello,
>
> I am writing a multithreaded program that uses genetic programming techniques 
> to construct new functions from existing building blocks. The purpose of this 
> is to try to find heuristics to automatically generate code to solve simple 
> problems. Due to the nature of this task, I do not have complete control over 
> the code that some experimenting processes are going to run. It is even 
> possible that they may start going into an infinite loop and there is nothing 
> I can do to prevent that without making the technique less powerful (as per 
> the halting-problem).
>
> I want to be able to pause these experimenting processes from the outside 
> (for prioritizing), but because I can't control what functions they are going 
> to run I need to use process-suspend for this. I also use process-kill to 
> stop them if I think they have hung up.
>
> Unfortunately, there is a bug, but I think I know why:
> Some of the building blocks of the constructed processes are critical 
> sections that should not be interrupted. To prevent this, I put locks around 
> them and make sure that before process-suspend or process-kill is called, all 
> locks are acquired first. The problem with this is that, as I just realized, 
> the functions process-suspend and process-kill may not wait until they have 
> taken effect before returning. Is this correct? Because if it is, it is 
> possible for my program to get into a deadlock in the following way:
>
> The manager process acquires all locks, calls process-suspend and releases 
> all locks again, but before the suspension actually takes effect. The target 
> process now enters a critical section, acquiring a lock, but before it can 
> leave the critical section again, the previously called suspension takes 
> effect. I now have a suspended process holding a lock (is this actually 
> possible or are locks automatically released on suspension?). The next time 
> the manager process wants to suspend or resume a process, it will get stuck 
> trying to acquire the lock that is being held by the suspended process.
>
> A similar problem also applies to process-kill, because a process should not 
> enter a critical section and then die so it never finishes all parts of the 
> critical section.
>
>
> Do you know if my suspicions where the bug is coming from are correct?
> If so, how can I do this instead?
>
>
> Best regards,
> Florian Dietz
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>