[Openmcl-devel] Help with error: [Stacks reset due to overflow.]

Sun May 18 17:06:49 PDT 2003

On Sun, 18 May 2003, Barry Perryman wrote:

> Sorry I forgot to mention in the previous email, to start this off bring up
> two terminals and load this code into both. In one type:
>
> (vending-machine-demo 4000)
>
> and in the other type:
>
> (finger-stress-test "localhost" :queries (list "t1" "t2" "t3" "c1" "c2"
> "h1") :port 4000)
>
> B
>

I didn't see anything funny happen when I ran the server under 0.14
and ran clients in both 0.13.5 and 0.14.  (In both cases, some of the
client threads got "connection reset by peer" errors; the client tries
to establish 10 roughly simultaneous connections and the server's
:BACKLOG defaults to allowing 5 pending (not-yet-accepted) connections
before the OS starts refusing them.  The 0.14 client had this happen
a little more often: 0.14 can create threads and wait for I/O a little
more efficiently, so is more likely to have too many simultaneous
pending connection requests.)

I was about to dismiss this as irreproducible when I decided to try
running the server in 0.13.5 (it wasn't clear to me from my reading
of your first message that this was where the problem was.)  I got
the "stacks reset" message a few times in the 0.13.5 server, and
things started to ring a bell.

About a year ago, some people who were working on porting Portable
AllegroServe to OpenMCL reported similar problems: they were creating
"watchdog" threads to enforce timeouts on server requests; the problems
seemed to be triggered by too many of these watchdog threads exiting
at around the same time.  In the process of looking into that, we
determined that (even if it worked reliably) creating and killing
short-lived threads was an expensive way of enforcing timeouts and
a cheaper mechanism was developed.  The problem that was causing
the stack overflows was never resolved.

In the cooperative scheduler, a thread does part of the work of
shutting itself down and asks the initial thread (which is guaranteed
to exist, more or less ...) to do the rest of it; the request is made
via some combination of PROCESS-INTERRUPT and the lower-level function
STACK-GROUP-INTERRUPT.

I never found out why exactly, but the stack overflow seemed to be
caused by confusion in the code that handles these interrupt requests
in the initial thread: if several threads started making interrupt
requests at about the same time, STACK-GROUP-INTERRUPT would start
losing it: it would start some sort of infinite recursion where
one would expect the recursive interrupt handling to be bounded by
the number of pending requests.  The bug seemed to be timing-related;
early versions of PAServe could provoke it pretty reliably, but it
took a while to develop simpler test cases that did so.

PAServe started using another mechanism and after a few days of poking
around in the debugger I made a few changes in STACK-GROUP-INTERRUPT
and wasn't able to provoke the bug anymore.  I wasn't very confident
that I really understood the problem or had really fixed it, and the
experience helped convince me that the "right" fix was to get rid
of the cooperative scheduler.  I don't think that anyone's reported
this since; I don't know whether that means that the bug is really
hard to provoke or if people just don't push that hard on the
cooperative scheduler in 0.13.  The bug seems to be triggered by
having a number of threads try to exit at about the same time;  I'd
guess that if you're able to pool threads in the server (so that
they're rarely created or destroyed) the problem will go away.  As
annoying and confusing as it is, the initial thread seems to eventually
get out of its confused state, stops overflowing its stack, and
finishes killing off the threads that it had been asked to kill.

I wasn't able to see any misbehavior in the server when running it
under 0.14, and I wouldn't have seen this misbehavior in any case:
STACK-GROUP-INTERRUPT is long gone, as are STACK-GROUPs for that
matter.  0.14 threads can exit and clean up after themselves without
help from other threads (the issue there is that it's sometimes
difficult to get a thread's attention and tell it to shut itself
down, but I think that this is getting better.)

I was able to get into the kernel debugger while the initial
thread was in a confused, hysterical state.  A backtrace showed:

(#xBFFFF650) #x010EF434 : #<Function SCHEDULER #x050cbe96> +0040
(#xBFFFF660) #x010E866C : #<Function HANDLE-STACK-GROUP-INTERRUPTS #x050c45ae> +0150
(#xBFFFF670) #x010E8604 : #<Function HANDLE-STACK-GROUP-INTERRUPTS #x050c45ae> +00e8
(#xBFFFF680) #x001047B4 : (subprimitive)
(#xBFFFF690) #x010E84EC : #<Function STACK-GROUP-RESUME #x050c4476> +0198
(#xBFFFF6A0) #x010EEA48 : #<Function %ACTIVATE-PROCESS #x050cb5de> +026c
(#xBFFFF6B0) #x001047B4 : (subprimitive)
(#xBFFFF6C0) #x010EF52C : #<Function SCHEDULER #x050cbe96> +0138
(#xBFFFF6D0) #x010EF434 : #<Function SCHEDULER #x050cbe96> +0040
(#xBFFFF6E0) #x010E866C : #<Function HANDLE-STACK-GROUP-INTERRUPTS #x050c45ae> +0150
(#xBFFFF6F0) #x010E8604 : #<Function HANDLE-STACK-GROUP-INTERRUPTS #x050c45ae> +00e8
(#xBFFFF700) #x001047B4 : (subprimitive)
(#xBFFFF710) #x010E84EC : #<Function STACK-GROUP-RESUME #x050c4476> +0198
(#xBFFFF720) #x010EEA48 : #<Function %ACTIVATE-PROCESS #x050cb5de> +026c
(#xBFFFF730) #x001047B4 : (subprimitive)
(#xBFFFF740) #x010EF52C : #<Function SCHEDULER #x050cbe96> +0138
(#xBFFFF750) #x010EF434 : #<Function SCHEDULER #x050cbe96> +0040
(#xBFFFF760) #x010E866C : #<Function HANDLE-STACK-GROUP-INTERRUPTS #x050c45ae> +0150

ad infinitum.

That's at least part of the problem: it's not supposed to be possible
to reenter the scheduler, and it appears that the scheduler's constantly
resuming the initial stack group, which notices that it's got some
interrupt requests to handle, then gets preempted by the scheduler ...

I'll look at it again; I may have missed something obvious last year.
I tend to view this bug as being a good argument for making it more
attractive for people to switch to 0.14.

_______________________________________________
Openmcl-devel mailing list
Openmcl-devel at clozure.com
http://clozure.com/cgi-bin/mailman/listinfo/openmcl-devel