[Openmcl-devel] mystery SEGV starting 64bit ccl on linux

Thu Nov 18 15:02:45 PST 2010

When running under GDB, it's necessary to tell GDB to ignore the
signals that CCL's own exception-handling mechanisms handle.  See

<http://trac.clozure.com/ccl/wiki/CclUnderGdb>

(You basically need to tell GDB to "source" a platform-specific
.gdbinit file that tells it which signals should be quietly passed
to the application.)

What you're seeing here is very likely expected behavior: the 
initial lisp thread has started running, tries to allocate some
lisp object, finds that it doesn't have any memory to cons in,
and executes a software interrupt that (on Linux) maps to SIGSEGV.
GDB's proudly announcing that it's noticed this, but hasn't yet
passed the signal on to CCL's handler, which should try to give
the thread a chunk of memory to cons in, skip over the interrupt
instruction, and allow the thread to resume execution.

A relatively recent change in the Linux kernel (described in

<http://trac.clozure.com/ccl/ticket/731>)

causes one of those calls to mmap() (the first one that tries to map memory
for a stack) to fail, since the address returned by mmap() in that case isn't
actually mapped.  This usually causes a very hard crash on startup; I -think-
that that happens pretty deterministically, but I'm not sure of that (and if
not, it'd be a likely explanation for sporadic crashes on startup.)

That problem seems to be triggered by use of the MAP_GROWSDOWN option in
the call to mmap that allocates stacks.  MAP_GROWSDOWN doesn't do what we
thought it does, and we've removed it from the sources in svn in 1.5 and
later.

I don't know if that's the cause of the problem that you're (sometimes)
seeing; if you haven't already done so, it'd probably be a good idea to
disable that option (if only to remove a variable from the equation.)

The output below indicates that you got past that, started to run
some lisp code, that lisp code consed, and GDB (mistakenly) thought
that that was notable.  It isn't.  We don't know whether you would
have run into some other problem after consing or whether what you've
been seing is just the problem described in ticket 731; if you have
been seeing that problem, the workaround seems to work reliably.

On Thu, 18 Nov 2010, Bit Twiddler wrote:

> I'm getting sporadic crashes starting ccl on various linuxes
> (CentOS 5.5, Scientific Linux 5.5, and Open Suse 11.3)
> 
> Current directory is ~/p/cl/ccl/1.5/release/ccl/
> GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5_5.1)
> Copyright (C) 2009 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later
> <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.? Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from
> /mnt/data1/home1/dsm/p/cl/ccl/1.5/release/ccl/lx86cl64...done.
> (gdb) run
> Starting program: /mnt/data1/home1/xxx/p/cl/ccl/1.5/release/ccl/lx86cl64
> [Thread debugging using libthread_db enabled]
> Reserving heap at 0x300000000000, size 0x8000000000
> Committing memory at 0x302000000000, size 0x540000
> Committing memory at 0x307c00000000, size 0xb000
> Committing memory at 0x307e3f800000, size 0xb000
> Committing memory at 0x302000540000, size 0x2000000
> Committing memory at 0x307c0000b000, size 0x40000
> Unprotecting memory at 0x307c0000b000, size 0x40000
> Committing memory at 0x307e3f80b000, size 0x40000
> Unprotecting memory at 0x307e3f80b000, size 0x40000
> Mapping stack of size 0x24d000
> Protecting memory at 0x2aaaaaacb000, size 0x1000
> Protecting memory at 0x2aaaaaacc000, size 0x19000
> Mapping stack of size 0x51000
> Protecting memory at 0x2aaaaad18000, size 0x10000
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x0000300000ab133f in ?? ()
> (gdb) bt
> #0? 0x0000300000ab133f in ?? ()
> #1? 0x000030200052f8af in ?? ()
> #2? 0x00002aaaaad16ff0 in ?? ()
> #3? 0x00000000004122c4 in toplevel_loop () at ../x86-subprims64.s:60
> Backtrace stopped: frame did not save the PC
> (gdb)
> 
> I don't understand why gdb can't display the backtrace, I have
> the optimizer turned off, and -g turned on.
> 
> Previously, I was able to get a crash after a mmap call to allocate
> a stack segment, but now the program runs after that, and I wind
> up with the above situation.
> 
> Does anybody know of any debugging code that I can enable?
> I turned on DEBUG_MEMORY by setting it to 1 in memory.c
> 
> 
>