[Openmcl-devel] Semaphore troubles

Wed May 9 23:10:31 PDT 2012

On Wed, 9 May 2012, James M. Lawrence wrote:

> I appreciate the time you've taken to respond.
>
> When I wrote "0.5 seconds or whatever", I meant that it fails for
> apparently any amount of time. As I mentioned in response to another
> reply, a 10 second sleep produces the failure as well. Is that
> consonant with your explanation?
>
> It is also unclear why
>
> * it does not fail when the loop is removed (along with the push nils)
>
> * it does not fail when (format t ".") is removed
>
> Perhaps these are just curiosities due to entropy in the underlying
> system calls.
>
> The upshot of what you're saying is that Clozure cannot reliably
> distribute work across threads, while other CL implementations can.

First of all, "Clozure" is a company; "Clozure CL" (or CCL) is a Common
Lisp implementation.  When talking to someone who works for the company
and distinguishes between those terms more than you do, you might want
to keep that in mind.  (I'm not just nitpicking; it actually takes a 
little longer to parse a sentence like that than you might think.)

Second of all, CCL doesn't "distribute work between threads"; neither
does SBCL, nor do recent versions of LispWorks, nor do other lisp
implementations that offer native threads.  The operating system does
this; a lisp implementation might or might not get in the OS's way to
some degree or in some way, but it can't exert the kind of
fine-grained control over how its threads are scheduled relative to
each other that a cooperative/userspace scheduler can and that you
still seem to be expecting.

A thread that's ready to run will probably run within 0.2 seconds, but
this isn't guaranteed (and, repeat after me, "CCL can't influence this
directly.")  It's more likely to run within 0.5 seconds (though that's
still not guaranteed), and if it doesn't run withing 10 seconds that's
probably so unlikely as to indicate a problem (though it ultimately
depends on what the overall system load is.)  If your code depends on
thread A running before thread B because thread B is sleeping at around
the time in question, you can either:

    - cross your fingers, hope that this expectation is met, and talk
      about how "the implementation can't reliably distribute work across
      multiple threads" as if that made sense, or:

    - use synchronization primitives (locks, semaphores) to enforce these
      constraints, uncross your fingers, and understand things that you
      didn't understand before.

Whether this (scheduler non-determinacy) is the cause of the problem you're
seeing or not, I'd think that you want to do something that doesn't involve
as much finger-crossing.

As I said, I think that it's surprising that you see the problem when one
thread sleeps for 10 seconds, but it occurred to me that if I could reproduce
that I might be able to poke around during that 10-second interval and see
what was going on.  I tried running your code (the first version you sent
today) and couldn't get it to fail at all (several thousand dots and no
assertion failures.)

I don't know what that means.  The machine (also a Core i7 running
Linux) was very lightly loaded; I was running a very recent trunk CCL,
but the differences between the trunk and 1.8 aren't supposed to be that
great at this point.  I couldn't get either the 32-bit or 64-bit version
to fail.

I'm curious about that and I'l try to look into it as time permits; if I
don't say something in the next couple of days, nag.

> On Wed, May 9, 2012 at 8:44 PM, Gary Byers <gb at clozure.com> wrote:
>>
>>
>> On Wed, 9 May 2012, James M. Lawrence wrote:
>>
>>> I thought my example was straightforward enough, though as I mentioned
>>> I wish it were smaller. Following your suggestion, I have replaced the
>>> queue with a stack. I have also taken out the condition-wait function
>>> copied from bordeaux-threads. My pop function now resembles your
>>> consume function.
>>>
>>> The same assertion failure occurs.
>>>
>>> I am unable to reproduce it with high debug settings, or with tracing,
>>> or with logging.
>>
>>
>>>
>>> The test consists of a pair of worker threads pulling from a task
>>> queue. We push two tasks: one task returns immediately, the other task
>>> sleeps for 0.2 seconds (it can be 0.5 seconds or whatever, it just
>>> takes longer to fail). Since we have two workers, we should always
>>> obtain the result of the sleeping task second. A signal is getting
>>> missed, or something.
>>
>>
>> You're assuming that whatever thread pulls the lambda that returns
>> 'SOONER will off of TASKS will push 'SOONER onto RECEIVER before
>> another thread pulls another lambda that sleeps for .2 seconds before
>> returning 'LATER pushes 'LATER on RECEIVER. ?That assumption is likely
>> to hold a high percentage of the time, but I can't think of anything
>> that guarantees it. (The OS scheduler may have decided that it should
>> let Emacs re-fontify some buffers for a while, or let the kernel
>> process all of those network packets that've been gumming up the
>> works, and when it gets back to CCL it finds that it's time for the
>> sleeping thread to wake up and it gets scheduled and pushes LATER
>> on RECEIVER before the other thread even wakes up. ?This kind of scenario
>> isn't as likely as one where 'SOONER is pushed first, but
>> it's not wildly improbable, either. ?It's "likely" that 'SOONER will
>> be pushed first - maybe even "highly likely". ?It's more likely (more
>> highly likely ?) if the sleeping thread sleeps longer, but non-realtime
>> OSes (like most flavors of Linux, like OSX, like ...) don't make the
>> scheduling guarantees that you seem to be assuming.
>>
>> While you're thinking "this thread should run before the other one because
>> it's ready to run and the other one is sleeping", the scheduler's thinking
>> "that CPU has been really active lately; better shut it down for a little
>> while so that it doesn't get too hot or consume too much power", or
>> something
>> equally obscure and unintuitive. ?If you change compiler options, or
>> do printing or logging (or otherwise change how threads use the CPU cycles
>> they're given), your code looks different to the scheduler and behaves
>> differently (in subtle and not-always-predictable ways.)
>>
>> Of all the thread-related bugs that've ever existed in CCL, the most
>> common cause has probably been "code wasn't prepared to deal with
>> concurrency"; a close second is probably "code is making unwarranted
>> assumptions about scheduler behavior." ?After many years of getting beaten
>> by those things, I think and hope that I'm more inclined to question some
>> assumptions that I used to make automatically and implicitly, and my first
>> reaction is to question the assumption that you're making. ?It's more likely
>> that the thread that doesn't sleep will push 'SOONER before the thread that
>> sleeps pushes 'LATER, but nothing guarantees this, lots of factors affect
>> what happens, and all that I can see is that things that're statistically
>> unlikely happen occasionally.
>>
>> Scheduling behavior is likely beyond the grasp of mere mortals; we can have
>> a reasonable, largely accurate model of how things will behave, but we have
>> to bear in mind that that's all we have.
>>
>> Semaphores in CCL are very thin wrappers around whatever the OS provides
>> (POSIX
>> semaphores, Mach semaphores, something-or-other on Windows.) ?If you say "a
>> [semaphore] must be getting dropped", you're either saying that there's a
>> problem
>> in that very thin wrapper or that we're all doomed (because what the OS
>> provides
>> doesn't work), and you're also saying that your code demonstrates this
>> problem
>> and no one else's notices. ?Some or all of those things could be true, but
>> you're
>> claiming that they must be because you think that you know which thread will
>> run before which other thread. ?You don't know that; all you really know is
>> that's
>> probably true.
>>
>>
>> ?(defun test ()
>>>
>>> ?(let ((tasks (make-stack)))
>>> ? (loop
>>> ? ? ?:repeat 2
>>> ? ? ?:do (ccl:process-run-function
>>> ? ? ? ? ? "test"
>>> ? ? ? ? ? (lambda ()
>>> ? ? ? ? ? ? (loop (funcall (or (pop-stack tasks)
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?(return)))))))
>>> ? (let ((receiver (make-stack)))
>>> ? ? (push-stack (lambda ()
>>> ? ? ? ? ? ? ? ? ? (push-stack (progn (sleep 0.2) 'later)
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? receiver))
>>> ? ? ? ? ? ? ? ? tasks)
>>> ? ? (push-stack (lambda ()
>>> ? ? ? ? ? ? ? ? ? (push-stack 'sooner receiver))
>>> ? ? ? ? ? ? ? ? tasks)
>>> ? ? (let ((result (pop-stack receiver)))
>>> ? ? ? (assert (eq 'sooner result)))
>>> ? ? (let ((result (pop-stack receiver)))
>>> ? ? ? (assert (eq 'later result))))
>>> ? (push-stack nil tasks)
>>> ? (push-stack nil tasks))
>>> ?(format t "."))
>>>
>>> (defun run ()
>>> ?(loop (test)))
>>> _______________________________________________
>>> Openmcl-devel mailing list
>>> Openmcl-devel at clozure.com
>>> http://clozure.com/mailman/listinfo/openmcl-devel
>>>
>>>
>>
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>