[Openmcl-devel] Semaphore troubles

Erik Pearson erik at defunweb.com
Thu May 10 15:13:54 PDT 2012


Ordinarily, extra time for sleeping is a good thing :)

On Thu, May 10, 2012 at 3:08 PM, Gary Byers <gb at clozure.com> wrote:

> Thanks (for finding this and for r'ing tfm); I hadn't realized.
>
> I committed the change (without trying any kind of sanity-checking) to the
> trunk and 1.8 branch
>
>
> On Thu, 10 May 2012, Erik Pearson wrote:
>
>  Hi Gary,
>> Note this passage in the definition of nanosleep:
>>
>> If the interval specified in?req?is not an exact multiple of the
>> granularity
>> underlying clock (see?time(7)), then the interval will be rounded up to
>> the
>>
>> next multiple.
>> http://linux.die.net/man/2/**nanosleep<http://linux.die.net/man/2/nanosleep>
>>
>> or from glibc docs:
>>
>> The actual elapsed time of the sleep interval might be longer since the
>> system rounds the elapsed time you request up to the next integer multiple
>> of the actual resolution the system can deliver. ? ? ? ? ?
>>
>> http://www.gnu.org/software/**libc/manual/html_node/**Sleeping.html<http://www.gnu.org/software/libc/manual/html_node/Sleeping.html>
>>
>> So here is the crux: The time remaining after an interrupted nanosleep may
>> actually be greater than the requested time if the interrupt happens right
>> after the timer starts, within the first increment of the resolution of
>> the
>> timer. (Printing debugging text from within the workaround code of
>> %nanosleep proves that it this is indeed the cause of our problems.)
>>
>> With 100 captures of the remaining time being greater than requested time
>> (of 0.01 sec, or 10,000,000 ns), the range was from?10,000,551
>> to?10,045,251ns (and for 750 captures of 0.001sec sleeps, from?1000272?to
>>
>> 1044971). So on my computer, and assuming that the timer is being set to
>> requested plus at most one increment of the timer resolution, my timer is
>> about 50,000ns resolution. So if the interrupt happens within 50,000ns of
>> the timer being set, the workaround code will cause the timer to exit
>> prematurely.
>>
>> So I'd vote for either conditionalizing the code for that version of OS X
>> (where maybe having timers fail early is better than some other disaster,
>> although I'm sure the workaround code can be tweaked to work better in
>> that
>> situation.)?
>>
>>
>> Erik.
>>
>> On Thu, May 10, 2012 at 12:28 PM, Gary Byers <gb at clozure.com> wrote:
>>      If anyone feels like testing a slightly different version of the
>>      patch ...
>>
>>      The current definition of CCL::%NANOSLEEP in
>>      ccl/level-1/l1-lisp-threads.**lisp
>>      looks like:
>>
>>      #-windows-target
>>      (defun %nanosleep (seconds nanoseconds)
>>      ?(with-process-whostate ("Sleep")
>>      ? ?(rlet ((a :timespec)
>>      ? ? ? ? ? (b :timespec))
>>      ? ? ?(setf (pref a :timespec.tv_sec) seconds
>>      ? ? ? ? ? ?(pref a :timespec.tv_nsec) nanoseconds)
>> ? ? ?(let* ((aptr a)
>> ? ? ? ? ? ? (bptr b))
>> ? ? ? ?(loop
>> ? ? ? ? ?(let* ((result
>> ? ? ? ? ? ? ? ? ?(external-call #+darwin-target "_nanosleep"
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? #-darwin-target "nanosleep"
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? :address aptr
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? :address bptr
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? :signed-fullword)))
>> ? ? ? ? ? ?(declare (type (signed-byte 32) result))
>> ? ? ? ? ? ?(if (and (< result 0)
>> ? ? ? ? ? ? ? ? ? ? (eql (%get-errno) (- #$EINTR)))
>> ? ? ? ? ? ? ?;; x86-64 Leopard bug.
>> ? ? ? ? ? ? ?(let* ((asec (pref aptr :timespec.tv_sec))
>> ? ? ? ? ? ? ? ? ? ? (bsec (pref bptr :timespec.tv_sec)))
>> ? ? ? ? ? ? ? ?(if (and (>= bsec 0)
>> ? ? ? ? ? ? ? ? ? ? ? ? (or (< bsec asec)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? (and (= bsec asec)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?(< (pref bptr :timespec.tv_nsec)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (pref aptr :timespec.tv_nsec)))))
>> ? ? ? ? ? ? ? ? ?(psetq aptr bptr bptr aptr)
>> ? ? ? ? ? ? ? ? ?(return)))
>> ? ? ? ? ? ? ?(return))))))))
>>
>>
>> (It should look like that in all relevant recent versions of CCL; the
>> code
>> hasn't changed in years.) ?Erik suggested replacing the LET* which
>>
>> follows
>> the comment ";; x86-64 Leopard bug" with just the PSETQ (so that we do
>> the
>> PSETQ and try to sleep a little longer, unconditionally); I'm curious
>> about
>> whether it would also work if we did the sanity-checking a little more
>> rigorously, by replacing the:
>>
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?(< (pref bptr :timespec.tv_nsec)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (pref aptr :timespec.tv_nsec)))))
>>
>> with
>>
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?(<= (pref bptr :timespec.tv_nsec)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?(pref aptr :timespec.tv_nsec)))))
>>
>>
>> As things have stood, if the "seconds" and "nanoseconds" fields in
>> both
>> "a" and "b" are exactly equal, we won't go back to sleep at all (and
>> this
>> could conceivably happen if we get interrupted before nanosleep goes
>> to
>> sleep.)
>>
>> If that change fixes the problem that James reported, I'm marginaly
>> more
>> comfortable with it than I am with removing the sanity checking at
>> all,
>> simply because:
>>
>> ?- I don't know if the bug that the sanity-checking was intended to
>> defend
>> ? against is still present in some supported version of OSX
>> ?- if it is, it's really nasty. ?IIRC, it was present in pre-releases
>> of
>> ? 10.5, I reported it to Apple (and I think that my bug report was
>> marked
>> ? as a duplicate), it wasn't fixed in the final 10.5, and ... that was
>> ? 5 years ago and I don't know what's happened since.
>>
>>
>> Thanks.
>>
>>
>>
>>
>> On Thu, 10 May 2012, James M. Lawrence wrote:
>>
>>      On Thu, May 10, 2012 at 12:16 PM, Erik Pearson
>>      <erik at defunweb.com> wrote:
>>      Hi James,
>>
>>      I'm sure Gary et al. will have a fix soon -- today
>>      if past performance is
>>      any measure -- but for now try this. In your ccl
>>      directory (/opt/ccl/ccl in
>>      my system, because I install my ccl from svn in
>>      /opt/ccl), in the level-1
>> directory, in the file l1-lisp-threads.lisp, ?hunt down
>> and replace the
>> %nanosleep function with this:
>>
>> #-windows-target
>> (defun %nanosleep (seconds nanoseconds)
>> ? (with-process-whostate ("Sleep")
>> ? ? (rlet ((a :timespec)
>> ? ? ? ? ? ?(b :timespec))
>> ?(setf (pref a :timespec.tv_sec) seconds
>> (pref a :timespec.tv_nsec) nanoseconds)
>> ?(let ((aptr a)
>> (bptr b))
>> ? ?(loop
>> ? ? ? (let ((result
>> ? ? ?(external-call #+darwin-target "_nanosleep"
>> ? ? #-darwin-target "nanosleep"
>> ? ? :address aptr
>> ? ? :address bptr
>> ? ? :signed-fullword)))
>> (declare (type (signed-byte 32) result))
>> (if (and (< result 0)
>> ?(eql (%get-errno) (- #$EINTR)))
>> ? ? (psetq aptr bptr bptr aptr)
>> ? ? (return))))))))
>>
>> All I did was remove the OS X workaround code. I'm working
>> with the
>> up-to-date trunk, v 1.9.
>>
>>
>> That appears to have fixed it. I went back and forth between the
>> old
>> and new %nanosleep for good measure. Congrats to all.
>>
>> Using latest lx86cl in trunk with 2 second sleeps.
>>
>> With old %nanosleep:
>>
>> fail at 18 iterations
>> fail at 32
>> fail at 46
>> fail at 11
>> fail at 74
>>
>> With new %nanosleep:
>>
>> no fail after 166 iterations
>> restart CCL
>> no fail after 189
>> restart CCL
>> no fail after 221
>> restart CCL
>> no fail after 159
>> restart CCL
>> no fail after 653 and still running
>> ______________________________**_________________
>> Openmcl-devel mailing list
>> Openmcl-devel at clozure.com
>> http://clozure.com/mailman/**listinfo/openmcl-devel<http://clozure.com/mailman/listinfo/openmcl-devel>
>>
>>
>> ______________________________**_________________
>> Openmcl-devel mailing list
>> Openmcl-devel at clozure.com
>> http://clozure.com/mailman/**listinfo/openmcl-devel<http://clozure.com/mailman/listinfo/openmcl-devel>
>>
>>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clozure.com/pipermail/openmcl-devel/attachments/20120510/78908456/attachment.htm>


More information about the Openmcl-devel mailing list