[Openmcl-devel] Semaphore troubles

Thu May 10 00:17:12 PDT 2012

If I remember correctly, the Leopard bug in nanosleep had two components:

   - if the call was interrupted right around the time it would have completed,
     errno was set to EINTR and the "remaining time" was set to a very small
     negative value (timeval.tv_sec was -1 in the "remaining time" structure)
   - even though timeval.tv_sec wasnegative, nanosleep() interpreted it as
     a very large integer and effectively slept forever (or until interrupted.)

The workaround was to try to sanity-check the remaining time value; we don't
want to go back to sleep for a negative number of seconds, and it doesn't
make sense to do so if the remaining time is greater than the original time.
If it's exactly equal (if a signal was received before we slept at all and
nanosleep just returns the original time as remaining time, then the sanity
check rejects that and we won't sleep at all, and that certainly looks suspect.
(And by "suspect", I of course mean "wrong" ...)

If you remove that bug workaround (and bug introduction) and James'
test can still fail (though very infrequently), then theories like
those that you and I were proposing still look good to me: you can
count on a runnable thread running before a sleeping thread, unless
the sleep interval is small and you get email or a cron job runs or
something else happens to kick the thread off of whatever CPU it's on.

I think that you're absolutely right in viewing the workaround in CCL::%NANOSLEEP
as suspect (see above), and that goes a very long way towards explaining this.

On Wed, 9 May 2012, Erik Pearson wrote:

> Hi James,
> Those results are certainly counter to my theory, and also counter to what I
> found. ?So ...
> 
> A little more poking around ... it looks like the call to sleep is being
> interrupted.
> 
> (run 1)
> Failed after 822: LATER: 2 (0.004005)
> Failed after 84: LATER: 2 (0.00422)
> Failed after 364: LATER: 2 (0.003999)
> Failed after 77: LATER: 2 (0.003568)
> Failed after 21: LATER: 2 (0.003631)
> 
> The number at the end is the seconds (elapsed, using get-internal-real-time)
> from the creation of the task to when it is popped off at the end. Could it
> be that sleep is being interrupted and not continuing as it should?
> 
> Looking at the ccl code, it does handle interruption from nanosleep, but
> there is a bit of code in there for OS X Leopard which is suspect. In the
> case of interruption it compares the values in the remaining time with the
> initial time. The honest algorithm is to just continue sleeping no matter
> what the remaining time values are, and it is possible that this workaround
> for OS X is not reliable (it looks okay, but I'm not sure that comparing the
> remaining to original is really guaranteed to work.)
> 
> In any case, removing that bit of code, recompiling ccl, and so far here is
> what i have:
> 
> (run 0.001)
> Failed after 3944: LATER: 2 (0.008116)
> Failed after 28: LATER: 2 (0.0073)
> Failed after 3955: LATER: 2 (0.005493)

> Failed after 301: LATER: 2 (0.010319)
> Failed after 1869: LATER: 2 (0.008788)
> 
> This may be the case where the sleep time is small enough to cause both
> tasks to complete before the main thread gets down to popping the receive
> stack.
> 
> For sleep of 0.01 and 1 sec, after 10 minutes no cases have occurred yet.
> 
> Returning to the computer after dinner -- the 0.01sec sleep test produced
> the "later before sooner" result after 520K iterations.
> [cleardot.gif]
> 
> Erik.
> 
> On Wed, May 9, 2012 at 6:46 PM, James M. Lawrence <llmjjmll at gmail.com>
> wrote:
>       I appreciate the time you've taken to respond.
>
>       When I wrote "0.5 seconds or whatever", I meant that it fails
>       for
>       apparently any amount of time. As I mentioned in response to
>       another
>       reply, a 10 second sleep produces the failure as well. Is that
>       consonant with your explanation?
>
>       It is also unclear why
>
>       * it does not fail when the loop is removed (along with the push
>       nils)
>
>       * it does not fail when (format t ".") is removed
>
>       Perhaps these are just curiosities due to entropy in the
>       underlying
>       system calls.
>
>       The upshot of what you're saying is that Clozure cannot reliably
>       distribute work across threads, while other CL implementations
>       can. I
>       would not call it a bug, but it's at least unfortunate. In fact
>       Clozure scales better than SBCL for parallel mapping and other
>       functions (stats available upon request), barring these peculiar
>       hiccups.
> 
>
>       On Wed, May 9, 2012 at 8:44 PM, Gary Byers <gb at clozure.com>
>       wrote:
>       >
>       >
>       > On Wed, 9 May 2012, James M. Lawrence wrote:
>       >
>       >> I thought my example was straightforward enough, though as I
>       mentioned
>       >> I wish it were smaller. Following your suggestion, I have
>       replaced the
>       >> queue with a stack. I have also taken out the condition-wait
>       function
>       >> copied from bordeaux-threads. My pop function now resembles
>       your
>       >> consume function.
>       >>
>       >> The same assertion failure occurs.
>       >>
>       >> I am unable to reproduce it with high debug settings, or with
>       tracing,
>       >> or with logging.
>       >
>       >
>       >>
>       >> The test consists of a pair of worker threads pulling from a
>       task
>       >> queue. We push two tasks: one task returns immediately, the
>       other task
>       >> sleeps for 0.2 seconds (it can be 0.5 seconds or whatever, it
>       just
>       >> takes longer to fail). Since we have two workers, we should
>       always
>       >> obtain the result of the sleeping task second. A signal is
>       getting
>       >> missed, or something.
>       >
>       >
>       > You're assuming that whatever thread pulls the lambda that
>       returns
>       > 'SOONER will off of TASKS will push 'SOONER onto RECEIVER
>       before
>       > another thread pulls another lambda that sleeps for .2 seconds
>       before
>       > returning 'LATER pushes 'LATER on RECEIVER. ?That assumption
>       is likely
>       > to hold a high percentage of the time, but I can't think of
>       anything
>       > that guarantees it. (The OS scheduler may have decided that it
>       should
>       > let Emacs re-fontify some buffers for a while, or let the
>       kernel
>       > process all of those network packets that've been gumming up
>       the
>       > works, and when it gets back to CCL it finds that it's time
>       for the
>       > sleeping thread to wake up and it gets scheduled and pushes
>       LATER
>       > on RECEIVER before the other thread even wakes up. ?This kind
>       of scenario
>       > isn't as likely as one where 'SOONER is pushed first, but
>       > it's not wildly improbable, either. ?It's "likely" that
>       'SOONER will
>       > be pushed first - maybe even "highly likely". ?It's more
>       likely (more
>       > highly likely ?) if the sleeping thread sleeps longer, but
>       non-realtime
>       > OSes (like most flavors of Linux, like OSX, like ...) don't
>       make the
>       > scheduling guarantees that you seem to be assuming.
>       >
>       > While you're thinking "this thread should run before the other
>       one because
>       > it's ready to run and the other one is sleeping", the
>       scheduler's thinking
>       > "that CPU has been really active lately; better shut it down
>       for a little
>       > while so that it doesn't get too hot or consume too much
>       power", or
>       > something
>       > equally obscure and unintuitive. ?If you change compiler
>       options, or
>       > do printing or logging (or otherwise change how threads use
>       the CPU cycles
>       > they're given), your code looks different to the scheduler and
>       behaves
>       > differently (in subtle and not-always-predictable ways.)
>       >
>       > Of all the thread-related bugs that've ever existed in CCL,
>       the most
>       > common cause has probably been "code wasn't prepared to deal
>       with
>       > concurrency"; a close second is probably "code is making
>       unwarranted
>       > assumptions about scheduler behavior." ?After many years of
>       getting beaten
>       > by those things, I think and hope that I'm more inclined to
>       question some
>       > assumptions that I used to make automatically and implicitly,
>       and my first
>       > reaction is to question the assumption that you're making.
>       ?It's more likely
>       > that the thread that doesn't sleep will push 'SOONER before
>       the thread that
>       > sleeps pushes 'LATER, but nothing guarantees this, lots of
>       factors affect
>       > what happens, and all that I can see is that things that're
>       statistically
>       > unlikely happen occasionally.
>       >
>       > Scheduling behavior is likely beyond the grasp of mere
>       mortals; we can have
>       > a reasonable, largely accurate model of how things will
>       behave, but we have
>       > to bear in mind that that's all we have.
>       >
>       > Semaphores in CCL are very thin wrappers around whatever the
>       OS provides
>       > (POSIX
>       > semaphores, Mach semaphores, something-or-other on Windows.)
>       ?If you say "a
>       > [semaphore] must be getting dropped", you're either saying
>       that there's a
>       > problem
>       > in that very thin wrapper or that we're all doomed (because
>       what the OS
>       > provides
>       > doesn't work), and you're also saying that your code
>       demonstrates this
>       > problem
>       > and no one else's notices. ?Some or all of those things could
>       be true, but
>       > you're
>       > claiming that they must be because you think that you know
>       which thread will
>       > run before which other thread. ?You don't know that; all you
>       really know is
>       > that's
>       > probably true.
>       >
>       >
>       > ?(defun test ()
>       >>
>       >> ?(let ((tasks (make-stack)))
>       >> ? (loop
>       >> ? ? ?:repeat 2
>       >> ? ? ?:do (ccl:process-run-function
>       >> ? ? ? ? ? "test"
>       >> ? ? ? ? ? (lambda ()
>       >> ? ? ? ? ? ? (loop (funcall (or (pop-stack tasks)
>       >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?(return)))))))
>       >> ? (let ((receiver (make-stack)))
>       >> ? ? (push-stack (lambda ()
>       >> ? ? ? ? ? ? ? ? ? (push-stack (progn (sleep 0.2) 'later)
>       >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? receiver))
>       >> ? ? ? ? ? ? ? ? tasks)
>       >> ? ? (push-stack (lambda ()
>       >> ? ? ? ? ? ? ? ? ? (push-stack 'sooner receiver))
>       >> ? ? ? ? ? ? ? ? tasks)
>       >> ? ? (let ((result (pop-stack receiver)))
>       >> ? ? ? (assert (eq 'sooner result)))
>       >> ? ? (let ((result (pop-stack receiver)))
>       >> ? ? ? (assert (eq 'later result))))
>       >> ? (push-stack nil tasks)
>       >> ? (push-stack nil tasks))
>       >> ?(format t "."))
>       >>
>       >> (defun run ()
>       >> ?(loop (test)))
>       >> _______________________________________________
>       >> Openmcl-devel mailing list
>       >> Openmcl-devel at clozure.com
>       >> http://clozure.com/mailman/listinfo/openmcl-devel
>       >>
>       >>
>       >
>       _______________________________________________
>       Openmcl-devel mailing list
>       Openmcl-devel at clozure.com
>       http://clozure.com/mailman/listinfo/openmcl-devel
> 
> 
> 
>