[Openmcl-devel] Semaphore troubles

Erik Pearson erik at defunweb.com
Thu May 10 14:39:26 PDT 2012


Hi Gary,

Note this passage in the definition of nanosleep:

If the interval specified in *req* is not an exact multiple of the
granularity underlying clock (see *time <http://linux.die.net/man/7/time>*(7)),
then the interval will be rounded up to the next multiple.
http://linux.die.net/man/2/nanosleep

or from glibc docs:

The actual elapsed time of the sleep interval might be longer since the
system rounds the elapsed time you request up to the next integer multiple
of the actual resolution the system can deliver.
http://www.gnu.org/software/libc/manual/html_node/Sleeping.html

So here is the crux: The time remaining after an interrupted nanosleep may
actually be greater than the requested time if the interrupt happens right
after the timer starts, within the first increment of the resolution of the
timer. (Printing debugging text from within the workaround code of
%nanosleep proves that it this is indeed the cause of our problems.)

With 100 captures of the remaining time being greater than requested time
(of 0.01 sec, or 10,000,000 ns), the range was from 10,000,551
to 10,045,251ns (and for 750 captures of 0.001sec sleeps, from 1000272 to
1044971). So on my computer, and assuming that the timer is being set to
requested plus at most one increment of the timer resolution, my timer is
about 50,000ns resolution. So if the interrupt happens within 50,000ns of
the timer being set, the workaround code will cause the timer to exit
prematurely.

So I'd vote for either conditionalizing the code for that version of OS X
(where maybe having timers fail early is better than some other disaster,
although I'm sure the workaround code can be tweaked to work better in that
situation.)

Erik.

On Thu, May 10, 2012 at 12:28 PM, Gary Byers <gb at clozure.com> wrote:

> If anyone feels like testing a slightly different version of the patch ...
>
> The current definition of CCL::%NANOSLEEP in ccl/level-1/l1-lisp-threads.*
> *lisp
> looks like:
>
>
> #-windows-target
> (defun %nanosleep (seconds nanoseconds)
>  (with-process-whostate ("Sleep")
>    (rlet ((a :timespec)
>           (b :timespec))
>      (setf (pref a :timespec.tv_sec) seconds
>            (pref a :timespec.tv_nsec) nanoseconds)
>      (let* ((aptr a)
>             (bptr b))
>        (loop
>          (let* ((result
>
>                  (external-call #+darwin-target "_nanosleep"
>                                 #-darwin-target "nanosleep"
>                                 :address aptr
>                                 :address bptr
>                                 :signed-fullword)))
>            (declare (type (signed-byte 32) result))
>            (if (and (< result 0)
>                     (eql (%get-errno) (- #$EINTR)))
>              ;; x86-64 Leopard bug.
>              (let* ((asec (pref aptr :timespec.tv_sec))
>                     (bsec (pref bptr :timespec.tv_sec)))
>                (if (and (>= bsec 0)
>                         (or (< bsec asec)
>                             (and (= bsec asec)
>                                  (< (pref bptr :timespec.tv_nsec)
>                                     (pref aptr :timespec.tv_nsec)))))
>
>                  (psetq aptr bptr bptr aptr)
>                  (return)))
>              (return))))))))
>
> (It should look like that in all relevant recent versions of CCL; the code
> hasn't changed in years.)  Erik suggested replacing the LET* which follows
> the comment ";; x86-64 Leopard bug" with just the PSETQ (so that we do the
> PSETQ and try to sleep a little longer, unconditionally); I'm curious about
> whether it would also work if we did the sanity-checking a little more
> rigorously, by replacing the:
>
>                                  (< (pref bptr :timespec.tv_nsec)
>                                     (pref aptr :timespec.tv_nsec)))))
>
> with
>
>                                  (<= (pref bptr :timespec.tv_nsec)
>                                      (pref aptr :timespec.tv_nsec)))))
>
> As things have stood, if the "seconds" and "nanoseconds" fields in both
> "a" and "b" are exactly equal, we won't go back to sleep at all (and this
> could conceivably happen if we get interrupted before nanosleep goes to
> sleep.)
>
> If that change fixes the problem that James reported, I'm marginaly more
> comfortable with it than I am with removing the sanity checking at all,
> simply because:
>
>  - I don't know if the bug that the sanity-checking was intended to defend
>   against is still present in some supported version of OSX
>  - if it is, it's really nasty.  IIRC, it was present in pre-releases of
>   10.5, I reported it to Apple (and I think that my bug report was marked
>   as a duplicate), it wasn't fixed in the final 10.5, and ... that was
>   5 years ago and I don't know what's happened since.
>
> Thanks.
>
>
>
>
>
> On Thu, 10 May 2012, James M. Lawrence wrote:
>
>  On Thu, May 10, 2012 at 12:16 PM, Erik Pearson <erik at defunweb.com> wrote:
>>
>>> Hi James,
>>>
>>> I'm sure Gary et al. will have a fix soon -- today if past performance is
>>> any measure -- but for now try this. In your ccl directory (/opt/ccl/ccl
>>> in
>>> my system, because I install my ccl from svn in /opt/ccl), in the level-1
>>> directory, in the file l1-lisp-threads.lisp, ?hunt down and replace the
>>>
>>> %nanosleep function with this:
>>>
>>> #-windows-target
>>> (defun %nanosleep (seconds nanoseconds)
>>> ? (with-process-whostate ("Sleep")
>>> ? ? (rlet ((a :timespec)
>>> ? ? ? ? ? ?(b :timespec))
>>> ?(setf (pref a :timespec.tv_sec) seconds
>>> (pref a :timespec.tv_nsec) nanoseconds)
>>> ?(let ((aptr a)
>>> (bptr b))
>>> ? ?(loop
>>> ? ? ? (let ((result
>>> ? ? ?(external-call #+darwin-target "_nanosleep"
>>> ? ? #-darwin-target "nanosleep"
>>> ? ? :address aptr
>>> ? ? :address bptr
>>> ? ? :signed-fullword)))
>>>
>>> (declare (type (signed-byte 32) result))
>>> (if (and (< result 0)
>>> ?(eql (%get-errno) (- #$EINTR)))
>>> ? ? (psetq aptr bptr bptr aptr)
>>> ? ? (return))))))))
>>>
>>>
>>> All I did was remove the OS X workaround code. I'm working with the
>>> up-to-date trunk, v 1.9.
>>>
>>
>> That appears to have fixed it. I went back and forth between the old
>> and new %nanosleep for good measure. Congrats to all.
>>
>> Using latest lx86cl in trunk with 2 second sleeps.
>>
>> With old %nanosleep:
>>
>> fail at 18 iterations
>> fail at 32
>> fail at 46
>> fail at 11
>> fail at 74
>>
>> With new %nanosleep:
>>
>> no fail after 166 iterations
>> restart CCL
>> no fail after 189
>> restart CCL
>> no fail after 221
>> restart CCL
>> no fail after 159
>> restart CCL
>> no fail after 653 and still running
>> ______________________________**_________________
>> Openmcl-devel mailing list
>> Openmcl-devel at clozure.com
>> http://clozure.com/mailman/**listinfo/openmcl-devel<http://clozure.com/mailman/listinfo/openmcl-devel>
>>
>>
>>  ______________________________**_________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/**listinfo/openmcl-devel<http://clozure.com/mailman/listinfo/openmcl-devel>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clozure.com/pipermail/openmcl-devel/attachments/20120510/4b1a5546/attachment.htm>


More information about the Openmcl-devel mailing list