[Openmcl-devel] CCL crash on windows (with multithreading and ffi callbacks)

Tue May 3 15:27:39 PDT 2011

On Tue, 3 May 2011, David Lichteblau wrote:

> Hi Anton,
>
> Quoting Anton Vodonosov (avodonosov at yandex.ru):
>>> So, the problem's either the callback, or it isn't ...
>>
>> Yes, precisely )
>
> if we have stack or heap corruption due to bugs in CL+SSL, it wouldn't
> be out of the ordinary for some Lisp/OS/ISA combination to be affected
> with a much larger likelihood than others.  It's easy to think that
> "this happens only on CCL/Win32, so it must be a CCL bug, right?" when
> in reality SBCL and CCL/Linux might just happen to not notice the
> corruption.
>
> [...]
>>> Is the symptom consistently a crash into the kernel debugger with a
>>> complaint that an exception occurred in foreign code (it may say
>>> "on foreign stack") ? ?Or does it die in a variety of ways ?
>>> If it does die in foreign code, please send me the output of the
>>> kernel debugger's 'r' command. ?I probably won't be able to tell
>>> what foreign code it's in, but might be able to tell if it's in
>>> CCL's GC or elsewhere.
>>
>> It doesn't fall into the kernel debugger. Windows just shows a message
>> box like "The program wx86cl64.exe is crashed" (it's an approximate translation,
>> the actual message is in Russian). and just removes the process.
>> Nothing appears in the CCL console.
>>
>> It is always the same way.
>
> After very, very brief testing here my results with 32bit CCL on Windows
> 7, in the vague hope that they might be useful to someone:
>
>  - With CCL 1.6 and a per-thread-per-connection hunchentoot taskmaster
>    it is dying with an out-of-memory problem soon after the first
>    1000something threads have been created.
>
>    This is obviously not the problem we are looking for, and with 1 MB
>    stack space per thread, and 1000 threads started, it's to be
>    expected that the address room and/or actual memory in my VM would
>    be getting crowded on a 32bit system.

CCL on 32-bit x86 uses three stacks (whose total size defaults to ~2.5MB) per
thread.  I'd be a little surprised if you could actually create 1000 threads
and stacks on a 32-bit machine; some of the 1000 threads that you see may
be in various stages of creation (and it's not likely that many of them are
getting much in the way of CPU time.)

My understanding is that hunchentoot will create a new thread any time that
a request comes in and there's no available worker thread to process it, and
that nothing tries to limit thread creation in this context.  (So, if worker
threads block waiting on a lock that's not being freed - for instance - more
threads will be created and presumably wait on the same lock, and things
will just continue in this vein until some sort of resource exhaustion occurs.)
That understanding is second-hand and anectdotal, but it seems consistent
with what I've seen and heard about when looking at hunchentoot core dumps.
(In other words, a large number of threads may be a symptom of a problem
which prevents a more reasonable number of threads from making progress and
of hunchentoot's behavior in that situation.)

>
>    I don't know why those threads aren't being cleaned up properly, but
>    there was recently talk on the list about a similar problem, so I
>    upgraded to trunk.  For whatever it's worth, this error didn't
>    happen again.

The problem that was discussed here a few months ago was very OSX-specific
(it had to do with Mach message ports - OS kernel resources - that were associated
with a thread not being freed when the thread exited.)

There -was- a similar problem in CCL on Windows a few years ago: a
Windows thread handle wasn't being freed when the associated thread
exited, and eventually thread creation would fail because the OS ran
out of thread handles.  As far as I know, that's been fixed (a user
reported it and diagnosed it pretty compleltely.)

>
>    To rule out other threading issues, I also switched to
>    single-threaded testing:
>
>  - With CCL trunk: I'm running the benchmark script with -c1, i.e. only
>    a single thread.  And on the server side, the thread id callback only
>    registers a single hunchentoot thread, which makes sense.
>
>    But it dies very quickly with a guard page violation (does that mean
>    stack overflow?):

As far as I know, yes.

In x86 CCL, lisp code an C code run on sepearate stacks.  The lisp stack
contains nothing but tagged lisp objects (even return addresses are tagged)
and the C stack contains - whatever it contains, but none of those things
are tagged lisp objects.  We switch between stacks on foreign function
calls and callbacks (among other things, this makes backtraces hard to
understand) and we also do some bizarre things when exceptions occur
(if the exception mechanism dumps a bunch of context info on the lisp
stack, we copy it to the foreign stack so that the lisp stack remains
easily parsable by the GC.)

I'm not sure that that copying checks to ensure that there's room on
the foreign stack; if that stack's about to overflow, things might
be better if we just tried to say "you're screwed" and exited than
if we demonstrated exactly how and why you were in that state.

As far as I know, when a thread's lisp stack overflows in CCL on
Windows, what happens is pretty well-defined and generally pretty
reliable.  (There's supposed to be and generally is enough stack
space between the point where the overflow is detected/reported and
the actual physical limit of the stack to report the condition and
poke around in a break loop to try to determine why the overflow occurred.)
If the foreign stack overflows, I'm not sure that we try to do anything
to recover from that (and I'm not even sure what recovery would mean),
and that may very well lead to a double-fault (a memory fault occurring
because the handler for a stack overflow condition overflows the stack.)
If that occurs, it wouldn't be surprising if the backtrace looks like
what you saw: we can see a call history back to the point where the
thread reentered user mode after an exception, but can't really see
what the thread was doing when the exception occurred.

I certainly may be mistaken, but circumstantial evidence suggests
that you're seeing an overflow on the foreign stack; that evidence
doesn't say anything at all about the cause of that overflow.  (CCL's
certainly been responsible for that sort of thing at various times:
an exception occurs, the handler calls out to lisp, a bug in the
lisp side of things causes causes an exception and we enter a
death spiral.  About all that we can say in cases like this is that
it's pretty amazing how quickly a modern processor can overflow a
fairly large stack ...

The fact that you can get misbehavior to occur when there should
be only one worker thread running is encouraging: that may at least
make things easier to debug.  I also agree that it's very hard to
know what the culprit or culprits are at this point.  (If this were
a movie, I'd probably want to call CCL in for questioning, try
to spot inconsistencies in its story, and tell it not to leave town
while reluctantly letting it go.  I'd probably repeat that process
with a half dozen or so other suspects, and be absolutely certain
that at least one of them was guilty in this case and that they
were all likely guilty of something ...)

The Windows environment complicates the debugging process somewhat:
"foreign" code may include a mixture of things compiled with MS tools
and things compiled with GNU tools; MS debuggers generally don't
understand symbolic information produced by GNU compilers and vice
versa, and neither debugger can make sense of compiled lisp code.
I don't know exactly  what to suggest to make things easier or how
to achieve this, but I think it's likely that both the case where
a large number of threads was getting created and this apparent
stack overflow are just symptoms of some underlying problem(s) (and
of CCL's/hunchentoot's/other suspects' response to those problems.)
The symptoms - severe as they are - are only interesting to the
extent that they shed light on the original problem (and perhaps
to the responses to them.)

That sounds reasonable, but I don't have a good idea of how to
find those underlying problems.

>
>  Microsoft (R) Windows Debugger Version 6.12.0002.633 X86
>  Copyright (c) Microsoft Corporation. All rights reserved.
>
>  *** wait with pending attach
>  Symbol search path is: *** Invalid ***
>  ****************************************************************************
>  * Symbol loading may be unreliable without a symbol search path.           *
>  * Use .symfix to have the debugger choose a symbol path.                   *
>  * After setting your symbol path, use .reload to refresh symbol locations. *
>  ****************************************************************************
>  Executable search path is:
>  ModLoad: 00010000 000f0000   c:\Program Files\ccl\wx86cl.exe
>
> [... many lines of DLL info elided ...]
>
>  ModLoad: 779e0000 77b1d000   C:\Windows\SYSTEM32\ntdll.dll
>
> [...]
>
>  ntdll!DbgBreakPoint:
>  77a13370 cc              int     3
>
> [let's tell windbg to ignore AccessViolation]
>
>  0:005> sx i av
>  0:005> sx e gp
>  0:005> g
>
> [at this point, ApacheBench makes its connection]
>
>  (b40.b50): Guard page violation - code 80000001 (first chance)
>  First chance exceptions are reported before any exception handling.
>  This exception may be expected and handled.
>  eax=00000000 ebx=03180024 ecx=03180040 edx=00000000 esi=77ab823c edi=0036fd08
>  eip=77a08ab2 esp=0317ffa8 ebp=0318000c iopl=0         nv up ei pl nz na po nc
>  cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010202
>  ntdll!TpSetTimer+0x1ac:
>  77a08ab2 56              push    esi
>
> [we're now in what windbg calls thread 7, and it certainly doesn't like
> the stack]
>
>  0:007> k
>  ChildEBP RetAddr
>  WARNING: Stack unwind information not available. Following frames may be wrong.
>  0318000c 00000000 ntdll!TpSetTimer+0x1ac
>
> For comparison, if I set a breakpoint on TpSetTimer before running the
> code, the only call I'm seeing has a good stacktrace (and is not in the
> thread that later fails).
>
>  Breakpoint 0 hit
>  eax=004dc728 ebx=77a0f4f4 ecx=024ffde8 edx=00000004 esi=76a47378 edi=00000000
>  eip=77a08906 esp=024ffde4 ebp=024ffe04 iopl=0         nv up ei pl nz na pe nc
>  cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000206
>  ntdll!TpSetTimer:
>  77a08906 8bff            mov     edi,edi
>  0:002> k
>  *** ERROR: Symbol file could not be found.  Defaulted to export symbols for C:\Windows\system32\RPCRT4.dll -
>  ChildEBP RetAddr
>  WARNING: Stack unwind information not available. Following frames may be wrong.
>  024ffde0 769cce6a ntdll!TpSetTimer
>  024ffe04 77a0f4cb RPCRT4!NdrUserMarshalMemorySize+0x71b
>  024ffe28 77a0e6f9 ntdll!TpDisassociateCallback+0xf4
>  024fff88 76891194 ntdll!RtlIsCriticalSectionLockedByThread+0x474
>  024fff94 77a3b429 kernel32!BaseThreadInitThunk+0x12
>  024fffd4 77a3b3fc ntdll!RtlInitializeExceptionChain+0x63
>  024fffec 00000000 ntdll!RtlInitializeExceptionChain+0x36
>
> d.
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>