[Openmcl-devel] Semaphore troubles

Erik Pearson erik at defunweb.com
Thu May 10 05:11:24 UTC 2012


Hi James,

Those results are certainly counter to my theory, and also counter to what
I found.  So ...

A little more poking around ... it looks like the call to sleep is being
interrupted.

(run 1)
Failed after 822: LATER: 2 (0.004005)
Failed after 84: LATER: 2 (0.00422)
Failed after 364: LATER: 2 (0.003999)
Failed after 77: LATER: 2 (0.003568)
Failed after 21: LATER: 2 (0.003631)

The number at the end is the seconds (elapsed, using
get-internal-real-time) from the creation of the task to when it is popped
off at the end. Could it be that sleep is being interrupted and not
continuing as it should?

Looking at the ccl code, it does handle interruption from nanosleep, but
there is a bit of code in there for OS X Leopard which is suspect. In the
case of interruption it compares the values in the remaining time with the
initial time. The honest algorithm is to just continue sleeping no matter
what the remaining time values are, and it is possible that this workaround
for OS X is not reliable (it looks okay, but I'm not sure that comparing
the remaining to original is really guaranteed to work.)

In any case, removing that bit of code, recompiling ccl, and so far here is
what i have:

(run 0.001)
Failed after 3944: LATER: 2 (0.008116)
Failed after 28: LATER: 2 (0.0073)
Failed after 3955: LATER: 2 (0.005493)
Failed after 301: LATER: 2 (0.010319)
Failed after 1869: LATER: 2 (0.008788)

This may be the case where the sleep time is small enough to cause both
tasks to complete before the main thread gets down to popping the receive
stack.

For sleep of 0.01 and 1 sec, after 10 minutes no cases have occurred yet.

Returning to the computer after dinner -- the 0.01sec sleep test produced
the "later before sooner" result after 520K iterations.

Erik.

On Wed, May 9, 2012 at 6:46 PM, James M. Lawrence <llmjjmll at gmail.com>wrote:

> I appreciate the time you've taken to respond.
>
> When I wrote "0.5 seconds or whatever", I meant that it fails for
> apparently any amount of time. As I mentioned in response to another
> reply, a 10 second sleep produces the failure as well. Is that
> consonant with your explanation?
>
> It is also unclear why
>
> * it does not fail when the loop is removed (along with the push nils)
>
> * it does not fail when (format t ".") is removed
>
> Perhaps these are just curiosities due to entropy in the underlying
> system calls.
>
> The upshot of what you're saying is that Clozure cannot reliably
> distribute work across threads, while other CL implementations can. I
> would not call it a bug, but it's at least unfortunate. In fact
> Clozure scales better than SBCL for parallel mapping and other
> functions (stats available upon request), barring these peculiar
> hiccups.
>
>
> On Wed, May 9, 2012 at 8:44 PM, Gary Byers <gb at clozure.com> wrote:
> >
> >
> > On Wed, 9 May 2012, James M. Lawrence wrote:
> >
> >> I thought my example was straightforward enough, though as I mentioned
> >> I wish it were smaller. Following your suggestion, I have replaced the
> >> queue with a stack. I have also taken out the condition-wait function
> >> copied from bordeaux-threads. My pop function now resembles your
> >> consume function.
> >>
> >> The same assertion failure occurs.
> >>
> >> I am unable to reproduce it with high debug settings, or with tracing,
> >> or with logging.
> >
> >
> >>
> >> The test consists of a pair of worker threads pulling from a task
> >> queue. We push two tasks: one task returns immediately, the other task
> >> sleeps for 0.2 seconds (it can be 0.5 seconds or whatever, it just
> >> takes longer to fail). Since we have two workers, we should always
> >> obtain the result of the sleeping task second. A signal is getting
> >> missed, or something.
> >
> >
> > You're assuming that whatever thread pulls the lambda that returns
> > 'SOONER will off of TASKS will push 'SOONER onto RECEIVER before
> > another thread pulls another lambda that sleeps for .2 seconds before
> > returning 'LATER pushes 'LATER on RECEIVER.  That assumption is likely
> > to hold a high percentage of the time, but I can't think of anything
> > that guarantees it. (The OS scheduler may have decided that it should
> > let Emacs re-fontify some buffers for a while, or let the kernel
> > process all of those network packets that've been gumming up the
> > works, and when it gets back to CCL it finds that it's time for the
> > sleeping thread to wake up and it gets scheduled and pushes LATER
> > on RECEIVER before the other thread even wakes up.  This kind of scenario
> > isn't as likely as one where 'SOONER is pushed first, but
> > it's not wildly improbable, either.  It's "likely" that 'SOONER will
> > be pushed first - maybe even "highly likely".  It's more likely (more
> > highly likely ?) if the sleeping thread sleeps longer, but non-realtime
> > OSes (like most flavors of Linux, like OSX, like ...) don't make the
> > scheduling guarantees that you seem to be assuming.
> >
> > While you're thinking "this thread should run before the other one
> because
> > it's ready to run and the other one is sleeping", the scheduler's
> thinking
> > "that CPU has been really active lately; better shut it down for a little
> > while so that it doesn't get too hot or consume too much power", or
> > something
> > equally obscure and unintuitive.  If you change compiler options, or
> > do printing or logging (or otherwise change how threads use the CPU
> cycles
> > they're given), your code looks different to the scheduler and behaves
> > differently (in subtle and not-always-predictable ways.)
> >
> > Of all the thread-related bugs that've ever existed in CCL, the most
> > common cause has probably been "code wasn't prepared to deal with
> > concurrency"; a close second is probably "code is making unwarranted
> > assumptions about scheduler behavior."  After many years of getting
> beaten
> > by those things, I think and hope that I'm more inclined to question some
> > assumptions that I used to make automatically and implicitly, and my
> first
> > reaction is to question the assumption that you're making.  It's more
> likely
> > that the thread that doesn't sleep will push 'SOONER before the thread
> that
> > sleeps pushes 'LATER, but nothing guarantees this, lots of factors affect
> > what happens, and all that I can see is that things that're statistically
> > unlikely happen occasionally.
> >
> > Scheduling behavior is likely beyond the grasp of mere mortals; we can
> have
> > a reasonable, largely accurate model of how things will behave, but we
> have
> > to bear in mind that that's all we have.
> >
> > Semaphores in CCL are very thin wrappers around whatever the OS provides
> > (POSIX
> > semaphores, Mach semaphores, something-or-other on Windows.)  If you say
> "a
> > [semaphore] must be getting dropped", you're either saying that there's a
> > problem
> > in that very thin wrapper or that we're all doomed (because what the OS
> > provides
> > doesn't work), and you're also saying that your code demonstrates this
> > problem
> > and no one else's notices.  Some or all of those things could be true,
> but
> > you're
> > claiming that they must be because you think that you know which thread
> will
> > run before which other thread.  You don't know that; all you really know
> is
> > that's
> > probably true.
> >
> >
> >  (defun test ()
> >>
> >>  (let ((tasks (make-stack)))
> >>   (loop
> >>      :repeat 2
> >>      :do (ccl:process-run-function
> >>           "test"
> >>           (lambda ()
> >>             (loop (funcall (or (pop-stack tasks)
> >>                                (return)))))))
> >>   (let ((receiver (make-stack)))
> >>     (push-stack (lambda ()
> >>                   (push-stack (progn (sleep 0.2) 'later)
> >>                               receiver))
> >>                 tasks)
> >>     (push-stack (lambda ()
> >>                   (push-stack 'sooner receiver))
> >>                 tasks)
> >>     (let ((result (pop-stack receiver)))
> >>       (assert (eq 'sooner result)))
> >>     (let ((result (pop-stack receiver)))
> >>       (assert (eq 'later result))))
> >>   (push-stack nil tasks)
> >>   (push-stack nil tasks))
> >>  (format t "."))
> >>
> >> (defun run ()
> >>  (loop (test)))
> >> _______________________________________________
> >> Openmcl-devel mailing list
> >> Openmcl-devel at clozure.com
> >> http://clozure.com/mailman/listinfo/openmcl-devel
> >>
> >>
> >
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clozure.com/pipermail/openmcl-devel/attachments/20120509/d70a74e0/attachment.html>


More information about the Openmcl-devel mailing list