[Openmcl-devel] process-enable issue
David Rager
ragerdl at cs.utexas.edu
Mon Aug 4 13:43:19 PDT 2008
I didn't test it extensively, but the updated sources+build seem to be
working. Thanks~
On Fri, Jul 25, 2008 at 3:46 PM, Gary Byers <gb at clozure.com> wrote:
> My mistake: if you just do:
>
> ? (make-process "foo")
>
> the process will run a little bit of code, add itself to the list of
> all processes, then signal a semaphore and wait to be preset and
> enabled.
>
> In David's case, the creating thread's wait had timed out, but by the
> time he did :proc, interrupted the waiting thread, and printed a
> backtrace, the thread was initialized and ready to go, and its
> whostate was "Active". That's a change in how whostates are
> implemented; in 1.1, the newly-reset thread would have reported itself
> as "Reset" instead of "Active", and the former's more accurate. The
> thread isn't really "Active" - it's still waiting to be preset and
> enabled - and I started postulating that the thread had somehow been
> enabled due to very low-level wires getting crossed somewhere.
>
> So, there are two bugs here:
>
> 1) the whole idea of a timeout in PROCESS-ENABLE is wrong (since we
> don't generally know how long it'll take for the target thread to
> get ready to run), and we should just wait indefinitely.
>
> 2) a newly-created or newly-reset thread should not have a whostate of
> "Active"; that's an unintentional change which can cause at least one
> person (the person who made the change) to get very confused.
>
> Sorry; will fix.
>
>
>
>
>
> On Fri, 25 Jul 2008, David Rager wrote:
>
> In the case described, when I (:y 35), and type :go (or whatever made the
>> lisp system ignore the warning), IIRC, it all worked. Therrefore, IIRC,
>> it's probably the latter, where one second isn't enough (or something new
>> is
>> occurring to make threads not swap in as much).
>>
>> The thing that may be indicative that it's not an OS problem, is that this
>> just started happening when I upgraded to the RC verson of CCL (RC 1.2?).
>> I
>> can inquire of our IT department if you would find whether there was an OS
>> change during this period to be relevant information. RC 1.2 fixed
>> another
>> OpenMCL problem (which I was quite pleased about), so it wasn't like I
>> could
>> just keep using the old OpenMCL.
>>
>> At least now our group is no longer the only group seeing and reporting
>> this
>> behavior.
>>
>> On Fri, Jul 25, 2008 at 1:25 PM, Gary Byers <gb at clozure.com> wrote:
>>
>> In the original bug report, the backtrace for what was thread #35 showed
>>>
>>> (2AAAAD619B18) : 0 (PROCESS-ENABLE #<PROCESS Worker thread(38) [Active]
>>> #x300043A1C8ED> [...]) 405
>>> (2AAAAD619B68) : 1 (%PROCESS-RUN-FUNCTION '(:NAME "Worker thread")
>>> #<COMPILED-LEXICAL-CLOSURE (:INTERNAL ACL2::RUN-THREAD) #x300043A1CD7F>
>>> NIL)
>>> 1373
>>> (2AAAAD619C58) : 2 (PROCESS-RUN-FUNCTION "Worker thread"
>>> #<COMPILED-LEXICAL-CLOSURE (:INTERNAL ACL2::RUN-THREAD) #x300043A1CD7F>
>>> [...]) 213
>>>
>>> and :proc showed
>>>
>>> 38 : Worker thread [Active]
>>> 35 : Worker thread [semaphore wait] (Requesting terminal input)
>>> 14 : Worker thread [semaphore wait]
>>> 1 : -> listener [Active]
>>> 0 : Initial [Active]
>>>
>>> In other words, thread 35 created thread 38 and was waiting for it
>>> to signal a semaphore that would indicate that it's reset itself
>>> and is ready to be enabled (given a function to run). :PROC shows
>>> that thread 38 is already running, which doesn't make much sense.
>>> The Linux kernel that David Rager was running was one that allegedly
>>> had just fixed a bug which could cause the the wrong thread to be
>>> awakened via FUTEX_WAIT, and it seemed plausible that that bug hadn't
>>> really been fixed there. The case that failed reliably for David
>>> on the machine that David was using worked reliably for me, similar
>>> cases seemed to work for others, and blaming this on something at
>>> the OS level makes more sense than anything else that I can think
>>> of. (Another fuzzy explanation is that malloc() - when called
>>> from two threads at the same time - returned the same block of
>>> memory to both callers because of a locking problem, so two
>>> threads wound up sharing the same "pointer to semaphore".)
>>>
>>> There's a separate issue in that PROCESS-ENABLE waits for the target
>>> thread to indicate that it's "ready" with a timeout of 1 second.
>>> That's usually long enough, but it's entirely arbitrary (how long
>>> it actually takes depends on the load on and the whims of the
>>> scheduler.) Taking longer than a second might indicate that the
>>> newly-created thread isn't getting enough CPU time to signal its
>>> readiness to run, The whole notion of having a timeout for
>>> something that can take an indeterminate amount of time is
>>> questionable, so it probably makes sense to not use a one-second
>>> timeout in PROCESS-ENABLE by default, at the very least.
>>>
>>> Can you tell whether it was the first case (where PROCESS-ENABLE
>>> was waiting to enable a thread that - somehow - seems to have
>>> already been enabled) or the second (the one-second timeout is
>>> too short, and quite possibly the entire idea of a timeout is
>>> misguided) or the second ?
>>>
>>> In the former case, the thread being enabled would be on the
>>> list returned by (ALL-PROCESSES) or in the output displayed
>>> by :PROC, and in the latter case it wouldn't.
>>>
>>> On Fri, 25 Jul 2008, Milan Jovanovic wrote:
>>>
>>> Hi, i have problems with multi-threading on linux, i think it's the same
>>>> like "http://trac.clozure.com/openmcl/ticket/297"
>>>> First it was "Unable to enable process #<PROCESS ...have been trying for
>>>>
>>> 1
>>>
>>>> seconds" and inferior-list segmentation fault after 2-3 hours of running
>>>> (this was on SUSE LINUX 10.0 X86-64 2.6.13-15-smp)
>>>>
>>>> After Gary Byers suggestion that it is meaby linux kernel bug i tried
>>>>
>>> on
>>>
>>>> SUSE Server 10 (x86_64) - kernel 2.6.24. After more then day of
>>>> running
>>>> with no errors i saw one more "Unable to enable process #<PROCESS
>>>>
>>> ...have
>>>
>>>> been trying for 1 seconds" but this time no segmentation fault.
>>>> So I'm asking is it problem/bug if this happens or only if it happens
>>>>
>>> with
>>>
>>>> segmentation fault following ?
>>>>
>>>> btw. i tried code on sbcl to be sure that it's not something there and
>>>>
>>> it's
>>>
>>>> running couple of days with no problems
>>>>
>>>> Thanks
>>>> Best,Milan
>>>>
>>>> _______________________________________________
>>> Openmcl-devel mailing list
>>> Openmcl-devel at clozure.com
>>> http://clozure.com/mailman/listinfo/openmcl-devel
>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clozure.com/pipermail/openmcl-devel/attachments/20080804/f7934c3c/attachment.htm>
More information about the Openmcl-devel
mailing list