[Openmcl-devel] Semaphore troubles

Gary Byers gb at clozure.com
Thu May 10 15:08:23 PDT 2012


Thanks (for finding this and for r'ing tfm); I hadn't realized.

I committed the change (without trying any kind of sanity-checking) to the
trunk and 1.8 branch

On Thu, 10 May 2012, Erik Pearson wrote:

> Hi Gary,
> Note this passage in the definition of nanosleep:
> 
> If the interval specified in?req?is not an exact multiple of the granularity
> underlying clock (see?time(7)), then the interval will be rounded up to the
> next multiple.
> http://linux.die.net/man/2/nanosleep
> 
> or from glibc docs:
> 
> The actual elapsed time of the sleep interval might be longer since the
> system rounds the elapsed time you request up to the next integer multiple
> of the actual resolution the system can deliver. ? ? ? ? ?
> http://www.gnu.org/software/libc/manual/html_node/Sleeping.html
> 
> So here is the crux: The time remaining after an interrupted nanosleep may
> actually be greater than the requested time if the interrupt happens right
> after the timer starts, within the first increment of the resolution of the
> timer. (Printing debugging text from within the workaround code of
> %nanosleep proves that it this is indeed the cause of our problems.)
> 
> With 100 captures of the remaining time being greater than requested time
> (of 0.01 sec, or 10,000,000 ns), the range was from?10,000,551
> to?10,045,251ns (and for 750 captures of 0.001sec sleeps, from?1000272?to
> 1044971). So on my computer, and assuming that the timer is being set to
> requested plus at most one increment of the timer resolution, my timer is
> about 50,000ns resolution. So if the interrupt happens within 50,000ns of
> the timer being set, the workaround code will cause the timer to exit
> prematurely.
> 
> So I'd vote for either conditionalizing the code for that version of OS X
> (where maybe having timers fail early is better than some other disaster,
> although I'm sure the workaround code can be tweaked to work better in that
> situation.)?
> 
> Erik.
> 
> On Thu, May 10, 2012 at 12:28 PM, Gary Byers <gb at clozure.com> wrote:
>       If anyone feels like testing a slightly different version of the
>       patch ...
>
>       The current definition of CCL::%NANOSLEEP in
>       ccl/level-1/l1-lisp-threads.lisp
>       looks like:
>
>       #-windows-target
>       (defun %nanosleep (seconds nanoseconds)
>       ?(with-process-whostate ("Sleep")
>       ? ?(rlet ((a :timespec)
>       ? ? ? ? ? (b :timespec))
>       ? ? ?(setf (pref a :timespec.tv_sec) seconds
>       ? ? ? ? ? ?(pref a :timespec.tv_nsec) nanoseconds)
> ? ? ?(let* ((aptr a)
> ? ? ? ? ? ? (bptr b))
> ? ? ? ?(loop
> ? ? ? ? ?(let* ((result
> ? ? ? ? ? ? ? ? ?(external-call #+darwin-target "_nanosleep"
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? #-darwin-target "nanosleep"
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? :address aptr
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? :address bptr
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? :signed-fullword)))
> ? ? ? ? ? ?(declare (type (signed-byte 32) result))
> ? ? ? ? ? ?(if (and (< result 0)
> ? ? ? ? ? ? ? ? ? ? (eql (%get-errno) (- #$EINTR)))
> ? ? ? ? ? ? ?;; x86-64 Leopard bug.
> ? ? ? ? ? ? ?(let* ((asec (pref aptr :timespec.tv_sec))
> ? ? ? ? ? ? ? ? ? ? (bsec (pref bptr :timespec.tv_sec)))
> ? ? ? ? ? ? ? ?(if (and (>= bsec 0)
> ? ? ? ? ? ? ? ? ? ? ? ? (or (< bsec asec)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? (and (= bsec asec)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?(< (pref bptr :timespec.tv_nsec)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (pref aptr :timespec.tv_nsec)))))
> ? ? ? ? ? ? ? ? ?(psetq aptr bptr bptr aptr)
> ? ? ? ? ? ? ? ? ?(return)))
> ? ? ? ? ? ? ?(return))))))))
> 
> (It should look like that in all relevant recent versions of CCL; the
> code
> hasn't changed in years.) ?Erik suggested replacing the LET* which
> follows
> the comment ";; x86-64 Leopard bug" with just the PSETQ (so that we do
> the
> PSETQ and try to sleep a little longer, unconditionally); I'm curious
> about
> whether it would also work if we did the sanity-checking a little more
> rigorously, by replacing the:
> 
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?(< (pref bptr :timespec.tv_nsec)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (pref aptr :timespec.tv_nsec)))))
> 
> with
> 
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?(<= (pref bptr :timespec.tv_nsec)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?(pref aptr :timespec.tv_nsec)))))
> 
> As things have stood, if the "seconds" and "nanoseconds" fields in
> both
> "a" and "b" are exactly equal, we won't go back to sleep at all (and
> this
> could conceivably happen if we get interrupted before nanosleep goes
> to
> sleep.)
> 
> If that change fixes the problem that James reported, I'm marginaly
> more
> comfortable with it than I am with removing the sanity checking at
> all,
> simply because:
> 
> ?- I don't know if the bug that the sanity-checking was intended to
> defend
> ? against is still present in some supported version of OSX
> ?- if it is, it's really nasty. ?IIRC, it was present in pre-releases
> of
> ? 10.5, I reported it to Apple (and I think that my bug report was
> marked
> ? as a duplicate), it wasn't fixed in the final 10.5, and ... that was
> ? 5 years ago and I don't know what's happened since.
> 
> Thanks.
> 
> 
> 
> 
> On Thu, 10 May 2012, James M. Lawrence wrote:
>
>       On Thu, May 10, 2012 at 12:16 PM, Erik Pearson
>       <erik at defunweb.com> wrote:
>       Hi James,
>
>       I'm sure Gary et al. will have a fix soon -- today
>       if past performance is
>       any measure -- but for now try this. In your ccl
>       directory (/opt/ccl/ccl in
>       my system, because I install my ccl from svn in
>       /opt/ccl), in the level-1
> directory, in the file l1-lisp-threads.lisp, ?hunt down
> and replace the
> %nanosleep function with this:
> 
> #-windows-target
> (defun %nanosleep (seconds nanoseconds)
> ? (with-process-whostate ("Sleep")
> ? ? (rlet ((a :timespec)
> ? ? ? ? ? ?(b :timespec))
> ?(setf (pref a :timespec.tv_sec) seconds
> (pref a :timespec.tv_nsec) nanoseconds)
> ?(let ((aptr a)
> (bptr b))
> ? ?(loop
> ? ? ? (let ((result
> ? ? ?(external-call #+darwin-target "_nanosleep"
> ? ? #-darwin-target "nanosleep"
> ? ? :address aptr
> ? ? :address bptr
> ? ? :signed-fullword)))
> (declare (type (signed-byte 32) result))
> (if (and (< result 0)
> ?(eql (%get-errno) (- #$EINTR)))
> ? ? (psetq aptr bptr bptr aptr)
> ? ? (return))))))))
> 
> All I did was remove the OS X workaround code. I'm working
> with the
> up-to-date trunk, v 1.9.
> 
> 
> That appears to have fixed it. I went back and forth between the
> old
> and new %nanosleep for good measure. Congrats to all.
> 
> Using latest lx86cl in trunk with 2 second sleeps.
> 
> With old %nanosleep:
> 
> fail at 18 iterations
> fail at 32
> fail at 46
> fail at 11
> fail at 74
> 
> With new %nanosleep:
> 
> no fail after 166 iterations
> restart CCL
> no fail after 189
> restart CCL
> no fail after 221
> restart CCL
> no fail after 159
> restart CCL
> no fail after 653 and still running
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
> 
> 
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
> 
> 
> 
>



More information about the Openmcl-devel mailing list