[Openmcl-devel] Semaphore troubles

Wed May 9 17:44:49 PDT 2012

On Wed, 9 May 2012, James M. Lawrence wrote:

> I thought my example was straightforward enough, though as I mentioned
> I wish it were smaller. Following your suggestion, I have replaced the
> queue with a stack. I have also taken out the condition-wait function
> copied from bordeaux-threads. My pop function now resembles your
> consume function.
>
> The same assertion failure occurs.
>
> I am unable to reproduce it with high debug settings, or with tracing,
> or with logging.

>
> The test consists of a pair of worker threads pulling from a task
> queue. We push two tasks: one task returns immediately, the other task
> sleeps for 0.2 seconds (it can be 0.5 seconds or whatever, it just
> takes longer to fail). Since we have two workers, we should always
> obtain the result of the sleeping task second. A signal is getting
> missed, or something.

You're assuming that whatever thread pulls the lambda that returns
'SOONER will off of TASKS will push 'SOONER onto RECEIVER before
another thread pulls another lambda that sleeps for .2 seconds before
returning 'LATER pushes 'LATER on RECEIVER.  That assumption is likely
to hold a high percentage of the time, but I can't think of anything
that guarantees it. (The OS scheduler may have decided that it should
let Emacs re-fontify some buffers for a while, or let the kernel
process all of those network packets that've been gumming up the
works, and when it gets back to CCL it finds that it's time for the
sleeping thread to wake up and it gets scheduled and pushes LATER
on RECEIVER before the other thread even wakes up.  This kind of 
scenario isn't as likely as one where 'SOONER is pushed first, but
it's not wildly improbable, either.  It's "likely" that 'SOONER will
be pushed first - maybe even "highly likely".  It's more likely (more
highly likely ?) if the sleeping thread sleeps longer, but non-realtime
OSes (like most flavors of Linux, like OSX, like ...) don't make the
scheduling guarantees that you seem to be assuming.

While you're thinking "this thread should run before the other one because
it's ready to run and the other one is sleeping", the scheduler's thinking
"that CPU has been really active lately; better shut it down for a little
while so that it doesn't get too hot or consume too much power", or something
equally obscure and unintuitive.  If you change compiler options, or
do printing or logging (or otherwise change how threads use the CPU cycles
they're given), your code looks different to the scheduler and behaves
differently (in subtle and not-always-predictable ways.)

Of all the thread-related bugs that've ever existed in CCL, the most
common cause has probably been "code wasn't prepared to deal with
concurrency"; a close second is probably "code is making unwarranted
assumptions about scheduler behavior."  After many years of getting beaten
by those things, I think and hope that I'm more inclined to question some
assumptions that I used to make automatically and implicitly, and my first
reaction is to question the assumption that you're making.  It's more likely
that the thread that doesn't sleep will push 'SOONER before the thread that
sleeps pushes 'LATER, but nothing guarantees this, lots of factors affect
what happens, and all that I can see is that things that're statistically
unlikely happen occasionally.

Scheduling behavior is likely beyond the grasp of mere mortals; we can have
a reasonable, largely accurate model of how things will behave, but we have
to bear in mind that that's all we have.

Semaphores in CCL are very thin wrappers around whatever the OS provides (POSIX
semaphores, Mach semaphores, something-or-other on Windows.)  If you say "a
[semaphore] must be getting dropped", you're either saying that there's a problem
in that very thin wrapper or that we're all doomed (because what the OS provides
doesn't work), and you're also saying that your code demonstrates this problem
and no one else's notices.  Some or all of those things could be true, but you're
claiming that they must be because you think that you know which thread will
run before which other thread.  You don't know that; all you really know is that's
probably true.

  (defun test ()
>  (let ((tasks (make-stack)))
>    (loop
>       :repeat 2
>       :do (ccl:process-run-function
>            "test"
>            (lambda ()
>              (loop (funcall (or (pop-stack tasks)
>                                 (return)))))))
>    (let ((receiver (make-stack)))
>      (push-stack (lambda ()
>                    (push-stack (progn (sleep 0.2) 'later)
>                                receiver))
>                  tasks)
>      (push-stack (lambda ()
>                    (push-stack 'sooner receiver))
>                  tasks)
>      (let ((result (pop-stack receiver)))
>        (assert (eq 'sooner result)))
>      (let ((result (pop-stack receiver)))
>        (assert (eq 'later result))))
>    (push-stack nil tasks)
>    (push-stack nil tasks))
>  (format t "."))
>
> (defun run ()
>  (loop (test)))
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>