[Openmcl-devel] CCL crash on windows (with multithreading and ffi callbacks)
David Lichteblau
david at lichteblau.com
Tue May 3 09:23:24 PDT 2011
Hi Anton,
Quoting Anton Vodonosov (avodonosov at yandex.ru):
> > So, the problem's either the callback, or it isn't ...
>
> Yes, precisely )
if we have stack or heap corruption due to bugs in CL+SSL, it wouldn't
be out of the ordinary for some Lisp/OS/ISA combination to be affected
with a much larger likelihood than others. It's easy to think that
"this happens only on CCL/Win32, so it must be a CCL bug, right?" when
in reality SBCL and CCL/Linux might just happen to not notice the
corruption.
[...]
> > Is the symptom consistently a crash into the kernel debugger with a
> > complaint that an exception occurred in foreign code (it may say
> > "on foreign stack") ? Or does it die in a variety of ways ?
> > If it does die in foreign code, please send me the output of the
> > kernel debugger's 'r' command. I probably won't be able to tell
> > what foreign code it's in, but might be able to tell if it's in
> > CCL's GC or elsewhere.
>
> It doesn't fall into the kernel debugger. Windows just shows a message
> box like "The program wx86cl64.exe is crashed" (it's an approximate translation,
> the actual message is in Russian). and just removes the process.
> Nothing appears in the CCL console.
>
> It is always the same way.
After very, very brief testing here my results with 32bit CCL on Windows
7, in the vague hope that they might be useful to someone:
- With CCL 1.6 and a per-thread-per-connection hunchentoot taskmaster
it is dying with an out-of-memory problem soon after the first
1000something threads have been created.
This is obviously not the problem we are looking for, and with 1 MB
stack space per thread, and 1000 threads started, it's to be
expected that the address room and/or actual memory in my VM would
be getting crowded on a 32bit system.
I don't know why those threads aren't being cleaned up properly, but
there was recently talk on the list about a similar problem, so I
upgraded to trunk. For whatever it's worth, this error didn't
happen again.
To rule out other threading issues, I also switched to
single-threaded testing:
- With CCL trunk: I'm running the benchmark script with -c1, i.e. only
a single thread. And on the server side, the thread id callback only
registers a single hunchentoot thread, which makes sense.
But it dies very quickly with a guard page violation (does that mean
stack overflow?):
Microsoft (R) Windows Debugger Version 6.12.0002.633 X86
Copyright (c) Microsoft Corporation. All rights reserved.
*** wait with pending attach
Symbol search path is: *** Invalid ***
****************************************************************************
* Symbol loading may be unreliable without a symbol search path. *
* Use .symfix to have the debugger choose a symbol path. *
* After setting your symbol path, use .reload to refresh symbol locations. *
****************************************************************************
Executable search path is:
ModLoad: 00010000 000f0000 c:\Program Files\ccl\wx86cl.exe
[... many lines of DLL info elided ...]
ModLoad: 779e0000 77b1d000 C:\Windows\SYSTEM32\ntdll.dll
[...]
ntdll!DbgBreakPoint:
77a13370 cc int 3
[let's tell windbg to ignore AccessViolation]
0:005> sx i av
0:005> sx e gp
0:005> g
[at this point, ApacheBench makes its connection]
(b40.b50): Guard page violation - code 80000001 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=00000000 ebx=03180024 ecx=03180040 edx=00000000 esi=77ab823c edi=0036fd08
eip=77a08ab2 esp=0317ffa8 ebp=0318000c iopl=0 nv up ei pl nz na po nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010202
ntdll!TpSetTimer+0x1ac:
77a08ab2 56 push esi
[we're now in what windbg calls thread 7, and it certainly doesn't like
the stack]
0:007> k
ChildEBP RetAddr
WARNING: Stack unwind information not available. Following frames may be wrong.
0318000c 00000000 ntdll!TpSetTimer+0x1ac
For comparison, if I set a breakpoint on TpSetTimer before running the
code, the only call I'm seeing has a good stacktrace (and is not in the
thread that later fails).
Breakpoint 0 hit
eax=004dc728 ebx=77a0f4f4 ecx=024ffde8 edx=00000004 esi=76a47378 edi=00000000
eip=77a08906 esp=024ffde4 ebp=024ffe04 iopl=0 nv up ei pl nz na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000206
ntdll!TpSetTimer:
77a08906 8bff mov edi,edi
0:002> k
*** ERROR: Symbol file could not be found. Defaulted to export symbols for C:\Windows\system32\RPCRT4.dll -
ChildEBP RetAddr
WARNING: Stack unwind information not available. Following frames may be wrong.
024ffde0 769cce6a ntdll!TpSetTimer
024ffe04 77a0f4cb RPCRT4!NdrUserMarshalMemorySize+0x71b
024ffe28 77a0e6f9 ntdll!TpDisassociateCallback+0xf4
024fff88 76891194 ntdll!RtlIsCriticalSectionLockedByThread+0x474
024fff94 77a3b429 kernel32!BaseThreadInitThunk+0x12
024fffd4 77a3b3fc ntdll!RtlInitializeExceptionChain+0x63
024fffec 00000000 ntdll!RtlInitializeExceptionChain+0x36
d.
More information about the Openmcl-devel
mailing list