[Openmcl-devel] CCL crash on windows (with multithreading and ffi callbacks)

Tue May 3 09:23:24 PDT 2011

Hi Anton,

Quoting Anton Vodonosov (avodonosov at yandex.ru):
> > So, the problem's either the callback, or it isn't ...
> 
> Yes, precisely )

if we have stack or heap corruption due to bugs in CL+SSL, it wouldn't
be out of the ordinary for some Lisp/OS/ISA combination to be affected
with a much larger likelihood than others.  It's easy to think that
"this happens only on CCL/Win32, so it must be a CCL bug, right?" when
in reality SBCL and CCL/Linux might just happen to not notice the
corruption.

[...]
> > Is the symptom consistently a crash into the kernel debugger with a
> > complaint that an exception occurred in foreign code (it may say
> > "on foreign stack") ?  Or does it die in a variety of ways ?
> > If it does die in foreign code, please send me the output of the
> > kernel debugger's 'r' command.  I probably won't be able to tell
> > what foreign code it's in, but might be able to tell if it's in
> > CCL's GC or elsewhere.
> 
> It doesn't fall into the kernel debugger. Windows just shows a message
> box like "The program wx86cl64.exe is crashed" (it's an approximate translation,
> the actual message is in Russian). and just removes the process. 
> Nothing appears in the CCL console.
> 
> It is always the same way.

After very, very brief testing here my results with 32bit CCL on Windows
7, in the vague hope that they might be useful to someone:

  - With CCL 1.6 and a per-thread-per-connection hunchentoot taskmaster
    it is dying with an out-of-memory problem soon after the first
    1000something threads have been created.

    This is obviously not the problem we are looking for, and with 1 MB
    stack space per thread, and 1000 threads started, it's to be
    expected that the address room and/or actual memory in my VM would
    be getting crowded on a 32bit system.

    I don't know why those threads aren't being cleaned up properly, but
    there was recently talk on the list about a similar problem, so I
    upgraded to trunk.  For whatever it's worth, this error didn't
    happen again.

    To rule out other threading issues, I also switched to
    single-threaded testing:

  - With CCL trunk: I'm running the benchmark script with -c1, i.e. only
    a single thread.  And on the server side, the thread id callback only
    registers a single hunchentoot thread, which makes sense.

    But it dies very quickly with a guard page violation (does that mean
    stack overflow?):

  *** wait with pending attach
  Symbol search path is: *** Invalid ***
  ****************************************************************************
  * Symbol loading may be unreliable without a symbol search path.           *
  * Use .symfix to have the debugger choose a symbol path.                   *
  * After setting your symbol path, use .reload to refresh symbol locations. *
  ****************************************************************************
  Executable search path is: 
  ModLoad: 00010000 000f0000   c:\Program Files\ccl\wx86cl.exe

[... many lines of DLL info elided ...] 

  ModLoad: 779e0000 77b1d000   C:\Windows\SYSTEM32\ntdll.dll

[...]

  ntdll!DbgBreakPoint:
  77a13370 cc              int     3

[let's tell windbg to ignore AccessViolation]

  0:005> sx i av
  0:005> sx e gp
  0:005> g

[at this point, ApacheBench makes its connection]

  (b40.b50): Guard page violation - code 80000001 (first chance)
  First chance exceptions are reported before any exception handling.
  This exception may be expected and handled.
  eax=00000000 ebx=03180024 ecx=03180040 edx=00000000 esi=77ab823c edi=0036fd08
  eip=77a08ab2 esp=0317ffa8 ebp=0318000c iopl=0         nv up ei pl nz na po nc
  cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010202
  ntdll!TpSetTimer+0x1ac:
  77a08ab2 56              push    esi

[we're now in what windbg calls thread 7, and it certainly doesn't like
the stack]

  0:007> k
  ChildEBP RetAddr  
  WARNING: Stack unwind information not available. Following frames may be wrong.
  0318000c 00000000 ntdll!TpSetTimer+0x1ac

For comparison, if I set a breakpoint on TpSetTimer before running the
code, the only call I'm seeing has a good stacktrace (and is not in the
thread that later fails).

  Breakpoint 0 hit
  eax=004dc728 ebx=77a0f4f4 ecx=024ffde8 edx=00000004 esi=76a47378 edi=00000000
  eip=77a08906 esp=024ffde4 ebp=024ffe04 iopl=0         nv up ei pl nz na pe nc
  cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000206
  ntdll!TpSetTimer:
  77a08906 8bff            mov     edi,edi
  0:002> k
  *** ERROR: Symbol file could not be found.  Defaulted to export symbols for C:\Windows\system32\RPCRT4.dll - 
  ChildEBP RetAddr  
  WARNING: Stack unwind information not available. Following frames may be wrong.
  024ffde0 769cce6a ntdll!TpSetTimer
  024ffe04 77a0f4cb RPCRT4!NdrUserMarshalMemorySize+0x71b
  024ffe28 77a0e6f9 ntdll!TpDisassociateCallback+0xf4
  024fff88 76891194 ntdll!RtlIsCriticalSectionLockedByThread+0x474
  024fff94 77a3b429 kernel32!BaseThreadInitThunk+0x12
  024fffd4 77a3b3fc ntdll!RtlInitializeExceptionChain+0x63
  024fffec 00000000 ntdll!RtlInitializeExceptionChain+0x36

d.