[Openmcl-devel] Crash on x64 ccl when loading weblocks

Fri Dec 3 05:42:24 PST 2010

On Fri, 3 Dec 2010, Ralph Moritz wrote:

> Thanks for explaining Gary. I've tried debugging with gdb & results are
> similar to VS, but now we have a function name:
> RtlGenerate8dot3Name.?Unfortunately, I'm stuck at this point since I don't
> know what foreign code is calling RtlGenerate8dot3Name since the addresses
> further up the stack seem to be bogus. FYI I'm running the latest ccl trunk.

While trying to figure out where the address #x2b3ca (from the backtrace)
was, it didn't occur to me to look at the trunk wx86cl64.exe.  When I do,
I see:

(gdb) x/9i freeGCptrs
0x2b3b0 <freeGCptrs>:	push   %rsi
0x2b3b1 <freeGCptrs+1>:	push   %rbx
0x2b3b2 <freeGCptrs+2>:	sub    $0x28,%rsp
0x2b3b6 <freeGCptrs+6>:
     mov    0x17dd3(%rip),%rcx        # 0x43190 <postGCptrs>
0x2b3bd <freeGCptrs+13>:	test   %rcx,%rcx
0x2b3c0 <freeGCptrs+16>:	je     0x2b3d2 <freeGCptrs+34>
0x2b3c2 <freeGCptrs+18>:	mov    (%rcx),%rbx
0x2b3c5 <freeGCptrs+21>:	callq  0x3ea30 <free>
0x2b3ca <freeGCptrs+26>:	test   %rbx,%rbx

So, that address looks a lot less suspect than it had; we're crashing
in a call to #_free in code that's routinely called after the GC runs.
The C source to this function is in ccl/lisp-kernel/gc-common.c, and
the relevant part of it looks like:

   [...]
   for (p = postGCptrs; p; p = next) {
     next = *((void **)p);
     free(p);
   }

postGCptrs is a linked list of foreign addresses of things allocated
by #_malloc; the code that we're crashing in is just walking that
list and calling #_free on each element.

That does narrow things down a bit, but the search space is still
pretty large.  Crashing here generally means one of:

1) we're trying to free something that's already been freed
2) the malloc heap is corrupt: something (quite possibly something
    unrelated to this code) has done a double free, or freed something
    not allocated by some variant of #_malloc, or written beyond the
    bounds of an allocated pointer, or otherwise scrambled things.

The good news is that there are ways of persuading #_malloc to check
for these things and report inconsistencies near the time that they
occur (which may be significantly before they time that they cause
memory faults or other symptoms.)  The bad news is that I don't
remember how to do this on Windows and need to take a nap.  If no one
else remembers in the next ... several hours, I'll try to Google for
the information.

> There has to be some way to figure out ?what code is responsible for the
> crash! If you know any advanced debugging techniques that you could share
> with me that would be great.
> gdb session paste:?http://paste.lisp.org/+2ICD/2
> 
> On 3 December 2010 12:08, Gary Byers <gb at clozure.com> wrote:
> 
>
>       On Fri, 3 Dec 2010, Ralph Moritz wrote:
>
>             I followed the instructions
>             in?http://trac.clozure.com/ccl/wiki/CclUnderGdb?&
>             adapted them for Visual
>             Studio. %rip is inside ntdll.dll (Windows kernel
>             functions).
>
>             The stacktrace is as follows:
>
>             ?? ?ntdll.dll!0000000077c01da0()
>             ?? ?msvcrt.dll!000007feff5010c4()
>             ?? ?wx86cl64.exe!000000000002b3ca()
>
>             Would this indicate that code within CCL is calling
>             a Windows function
>             incorrectly?
> 
> 
> I'm not sure that I trust the address 0x2b3ca; in every copy of
> wx86cl64.exe that I've looked at, that address is in the middle of
> some instruction. ?The ability of MS tools to generate meaningful
> backtraces for GCC-compiled code (and vice versa) is pretty limited.
> If the call that led to the crash was indeed from that address, it
> -may- be in code that tries to free "GCable pointers" after the GC
> runs; a scenario like:
> 
> (let* ((p (make-gcable-record ...)))
> ?(use p) ; somehow
> ?(#_free p))
> 
> can cause problems where the GC tries to free the pointer again.
> This may be a red herring, since again I'm not sure that the
> reported address #x2b3ca is meaningful.
> 
> All foreign function calls go through a little bit of code in
> the lisp kernel, so seeing that something in wx86cl64.exe called
> into some other libraries -could- mean that some foreign function
> call is responsible. ?If so, we can't tell what foreign function
> call, or whether it's a part of CCL itself or some lisp library or
> ...)
> 
> CCL 1.5 on Windows had some general stability problems (related to
> threads and GC) and the Win64 FFI had problems passing arguments
> in some cases. ?(IIRC, the cases in question had to do with functions
> whose first few args were a mixture of floating-point and non-FP
> values, and with functions whose first several arguments were floats.)
> We fixed these in the 1.5 svn tree; if you haven't already done so,
> it'd be a good idea to do an "svn update" of a 1.5 installation
> followed
> by a (rebuild-ccl :clean t), or to switch to the 1.6 prerelease.
> 
>
>       Regards,
>       Ralph
> 
> 
> 
> 
> 
> 
>