[Openmcl-devel] Problem rebuilding on Linux x8632 w gcc 4.6.2
Gary Byers
gb at clozure.com
Fri Feb 10 07:39:26 PST 2012
Something prints as a #<BOGUS OBJECT> because it fails some simple
consistency checks that the printer applies before printing things.
Those checks include tests like "if something appears to be in the
lisp heap, make sure that it's in an allocated part of the heap
and not in free space", "if it's not in the heap, it had better be
on a stack", "if its tag bits indicate that the first word at the
address it's referencing should be a header word which further
describes the object's type and size, make sure that that header
word is reasonable" and "if the tag bits say that the first word
shouldn't be a header, make sure that it isn't. These tests aren't
perfect (it's hard to know exactly where another thread's stack's
bounds are in general, because in general they're changing while
you look at them), but they catch a lot of things and it's generally
better to print something as a #<BOGUS OBJECT> than it would be to
try to print its contents if those tests fail.
There are generally a couple of ways in which something that'd print
that way can be produced.
1) By violating a DYNAMIC-EXTENT declaration.
(defun foo ()
(let* ((x (cons nil nil)))
(declare (dynamic-extent x))
x))
A DYNAMIC-EXTENT declaration says that the extent (lifetime) of the
initial value of the binding to which it applies is coterminous with
the scope of that binding, so the compiler could stack-allocate the
value (if it knows how) or otherwise try to exploit that information.
By returning that value in the code above, we're violating that claim.
The value returned might print as a BOGUS-OBJECT; it'd generally be
worse if it didn't (if whatever was later allocated on the stack where
X used to point happened to look plausible as a CONS cell.)
2) By introducing some form of memory corruption.
You can do this yourself by doing something incorrect (like storing
past the end of a vector) in unsafe code; a compiler bug might do this
for you. The GC does a lot of reading from and writing to heap memory
and if it doesn't do this correctly it can introduce memory corruption
(and in some cases can introduce a lot of it before anything notices
and starts displaying visible symptoms, so this can be very difficult
to track down.)
The GC's about the only part of CCL that's written in C and where we
spend much time; since we generally want you to spend as little time
there as we can get away with, it (and the rest of the kernel) are
compiled with -O2 to put a fairly high level of optimization in effect.
This actually makes a difference: without that, the GC can be measurably
slower. If -O2 makes it faster, it might be reasonable to assume that
higher optimization levels might make it faster still, but those higher
optimization levels can make code even harder to debug (in GDB) than -O2
does and they -may- in some cases introduce bugs (some rarely-exercised
optimizations interfere in some unpredictable way) and -O2 seems to be
a reasonable compromise. (Again, this pretty much only affects GC time
and - depending on what Lisp code is doing - GC time is usually going
to be a relatively small percentage of total execution time; we're just
talking about making that relatively small percentage a bit smaller.)
If I was dealing with a major new release of GCC, I might (in some
combination of fore- and hindsight) be less willing to trust the code
it produces at -O2; if it was a prerelease (and my hindsight was as
good as it seems to be at the moment) I might be even less likely
to trust it.
In the lisp kernel Makefile (I assume that that'd be
ccl/lisp-kernel/linuxx8632/Makefile in your case), you'll see (maybe a
screenful or so from the top) a line that says:
COPT = -O2
If you change that to
COPT = # -O2
and then do:
$ cd ccl/lisp-kernel/linuxx8632
$ make clean
$ make
you'll get a new ccl/lx86cl where the C code is less aggessively optimized
(and where the GC might be a little slower as a result); whether or not
your tests fail in that kernel will tell us a lot about the problem.
I should say (just for the record) that if the presence or absence of
-O2 seems to be the determining factor, that doesn't necessarily mean
that the bug's in GCC 4.6; it's entirely possible that there's some
code in CCL's GC that depends on implementation artifacts (like
evaluation order) in some way, and GCC 4.6 may just do things in a
different way than earlier versions did (and some CCL code may be
implicitly expecting the old behavior.) There's been a few times over
the years where GCC optimization levels seemed to affect semantics,
and I think that in all (or very nearly all) of those cases the
problem was in our code and not the C compiler. That said, most Linux
distributions and other OSes tend to ship with more nominally mature C
compilers, and the fact that the one that you're using identifies itself
as a "prerelease" version suggests that outright bugs at -O2 may not
be totally unexpected.
On Thu, 9 Feb 2012, Erik Pearson wrote:
> HI Fellow CCLispers,
>
> To cut to the chase, it looks like rebuilding CCL 1.7 in my specific
> environment creates a bad boy:
>
> The platform is:
>
> Linux archie 3.2.4-1-ARCH #1 SMP PREEMPT Sat Feb 4 11:21:15 UTC 2012
> i686 Intel(R) Atom(TM) CPU D525 @ 1.80GHz GenuineIntel GNU/Linux
>
> I'm using the latest CCL 1.7 from SVN: 1.7-r14925M
>
> My current gcc is: gcc (GCC) 4.6.2 20120120 (prerelease)
>
> What happens is that after a (rebuild :full t), I get the symptoms
> described (way) below -- which, to summarize, is predominantly value
> stack overflow and bogus object in a call to remote-socket-info.
>
> The testing that allowed me to come to this conclusion is that I do
> not encounter this error at all when using the binary from svn. The
> binary from svn has survived two tests of over 1million page hits with
> 50 concurrent siege requests (which does not result in 50 concurrent
> threads -- there are 10, which reach a concurrency in terms of thread
> execution somewhat below that.) The rebuilt 1.7 and 1.8 fail after
> several thousand to several hundred thousand.
>
> One little problem is that I need to rebuild at least one time because
> I have a little patch to with-encoded-cstrs (a new version called
> with-encoded-strings that also returns the length) that I need to have
> available for production.
>
> I compiled the lisp kernel for Linux x86 32 with all warnings, but it
> has been a long time since I've used gcc ...
>
> [epearson at archie linuxx8632]$ vi Makefile
> [epearson at archie linuxx8632]$ make
> m4 -DLINUX -DX86 -DX8632 -DHAVE_TLS -I../ ../pad.s | as --32 -o pad.o
> m4 -DLINUX -DX86 -DX8632 -DHAVE_TLS -I../ ../x86-spjump32.s | as --32
> -o x86-spjump32.o
> m4 -DLINUX -DX86 -DX8632 -DHAVE_TLS -I../ ../x86-spentry32.s | as
> --32 -o x86-spentry32.o
> m4 -DLINUX -DX86 -DX8632 -DHAVE_TLS -I../ ../x86-subprims32.s | as
> --32 -o x86-subprims32.o
> cc -include ../platform-linuxx8632.h -c ../pmcl-kernel.c -DLINUX
> -D_REENTRANT -DX86 -DX8632 -D_GNU_SOURCE -DHAVE_TLS -DUSE_FUTEX
> -DSVN_REVISION=15201M -g -O2 -Wall -m32 -o pmcl-kernel.o
> ../pmcl-kernel.c: In function ?allocate_lisp_stack?:
> ../pmcl-kernel.c:232:11: warning: variable ?overhead? set but not used
> [-Wunused-but-set-variable]
> ../pmcl-kernel.c: In function ?create_reserved_area?:
> ../pmcl-kernel.c:533:7: warning: variable ?h? set but not used
> [-Wunused-but-set-variable]
> ../pmcl-kernel.c: In function ?process_options?:
> ../pmcl-kernel.c:1171:19: warning: variable ?wval? set but not used
> [-Wunused-but-set-variable]
> cc -include ../platform-linuxx8632.h -c ../gc-common.c -DLINUX
> -D_REENTRANT -DX86 -DX8632 -D_GNU_SOURCE -DHAVE_TLS -DUSE_FUTEX
> -DSVN_REVISION=15201M -g -O2 -Wall -m32 -o gc-common.o
> ../gc-common.c: In function ?gc?:
> ../gc-common.c:1779:11: warning: format ?%d? expects argument of type
> ?int?, but argument 4 has type ?__time_t? [-Wformat]
> ../gc-common.c:1779:11: warning: format ?%d? expects argument of type
> ?int?, but argument 5 has type ?__suseconds_t? [-Wformat]
> ../gc-common.c:1784:19: warning: format ?%d? expects argument of type
> ?int?, but argument 5 has type ?__time_t? [-Wformat]
> ../gc-common.c:1784:19: warning: format ?%d? expects argument of type
> ?int?, but argument 6 has type ?__suseconds_t? [-Wformat]
> ../gc-common.c: In function ?reaphashv?:
> ../gc-common.c:265:14: warning: ?bitsp? may be used uninitialized in
> this function [-Wuninitialized]
> ../gc-common.c:260:34: warning: ?bitidx? may be used uninitialized in
> this function [-Wuninitialized]
> ../gc-common.c:259:13: warning: ?bits? may be used uninitialized in
> this function [-Wuninitialized]
> ../gc-common.c: In function ?mark_weak_hash_vector?:
> ../gc-common.c:409:14: warning: ?bitsp? may be used uninitialized in
> this function [-Wuninitialized]
> ../gc-common.c:404:34: warning: ?bitidx? may be used uninitialized in
> this function [-Wuninitialized]
> ../gc-common.c:403:13: warning: ?bits? may be used uninitialized in
> this function [-Wuninitialized]
> cc -include ../platform-linuxx8632.h -c ../x86-gc.c -DLINUX
> -D_REENTRANT -DX86 -DX8632 -D_GNU_SOURCE -DHAVE_TLS -DUSE_FUTEX
> -DSVN_REVISION=15201M -g -O2 -Wall -m32 -o x86-gc.o
> ../x86-gc.c: In function ?check_all_areas?:
> ../x86-gc.c:340:5: warning: enumeration value ?AREA_VOID? not handled
> in switch [-Wswitch]
> ../x86-gc.c:340:5: warning: enumeration value ?AREA_CSTACK? not
> handled in switch [-Wswitch]
> ../x86-gc.c:340:5: warning: enumeration value ?AREA_READONLY? not
> handled in switch [-Wswitch]
> ../x86-gc.c:340:5: warning: enumeration value ?AREA_STATIC_CONS? not
> handled in switch [-Wswitch]
> ../x86-gc.c: In function ?check_refmap_consistency?:
> ../x86-gc.c:1161:31: warning: variable ?prev? set but not used
> [-Wunused-but-set-variable]
> ../x86-gc.c: In function ?compact_dynamic_heap?:
> ../x86-gc.c:2020:89: warning: variable ?prev? set but not used
> [-Wunused-but-set-variable]
> ../x86-gc.c: In function ?update_managed_refs?:
> ../x86-gc.c:2754:21: warning: variable ?prev? set but not used
> [-Wunused-but-set-variable]
> ../x86-gc.c: In function ?purify?:
> ../x86-gc.c:2827:11: warning: variable ?new_pure_start? set but not
> used [-Wunused-but-set-variable]
> cc -include ../platform-linuxx8632.h -c ../bits.c -DLINUX -D_REENTRANT
> -DX86 -DX8632 -D_GNU_SOURCE -DHAVE_TLS -DUSE_FUTEX
> -DSVN_REVISION=15201M -g -O2 -Wall -m32 -o bits.o
> cc -include ../platform-linuxx8632.h -c ../x86-exceptions.c -DLINUX
> -D_REENTRANT -DX86 -DX8632 -D_GNU_SOURCE -DHAVE_TLS -DUSE_FUTEX
> -DSVN_REVISION=15201M -g -O2 -Wall -m32 -o x86-exceptions.o
> ../x86-exceptions.c: In function ?handle_gc_trap?:
> ../x86-exceptions.c:192:3: warning: dereferencing type-punned pointer
> will break strict-aliasing rules [-Wstrict-aliasing]
> ../x86-exceptions.c: In function ?handle_exception?:
> ../x86-exceptions.c:1155:4: warning: implicit declaration of function
> ?handle_watch_trap? [-Wimplicit-function-declaration]
> ../x86-exceptions.c: In function ?signal_handler?:
> ../x86-exceptions.c:1422:5: warning: format ?%lx? expects argument of
> type ?long unsigned int?, but argument 5 has type ?natural? [-Wformat]
> ../x86-exceptions.c:1422:5: warning: format ?%lx? expects argument of
> type ?long unsigned int?, but argument 6 has type ?unsigned int?
> [-Wformat]
> ../x86-exceptions.c:1422:5: warning: format ?%lx? expects argument of
> type ?long unsigned int?, but argument 5 has type ?natural? [-Wformat]
> ../x86-exceptions.c:1422:5: warning: format ?%lx? expects argument of
> type ?long unsigned int?, but argument 6 has type ?unsigned int?
> [-Wformat]
> ../x86-exceptions.c: In function ?pc_luser_xp?:
> ../x86-exceptions.c:2560:9: warning: dereferencing type-punned pointer
> will break strict-aliasing rules [-Wstrict-aliasing]
> ../x86-exceptions.c:2607:13: warning: dereferencing type-punned
> pointer will break strict-aliasing rules [-Wstrict-aliasing]
> ../x86-exceptions.c: In function ?handle_exception?:
> ../x86-exceptions.c:1290:1: warning: control reaches end of non-void
> function [-Wreturn-type]
> cc -include ../platform-linuxx8632.h -c ../image.c -DLINUX
> -D_REENTRANT -DX86 -DX8632 -D_GNU_SOURCE -DHAVE_TLS -DUSE_FUTEX
> -DSVN_REVISION=15201M -g -O2 -Wall -m32 -o image.o
> ../image.c: In function ?load_openmcl_image?:
> ../image.c:372:2: warning: implicit declaration of function ?set_nil?
> [-Wimplicit-function-declaration]
> ../image.c:382:2: warning: implicit declaration of function
> ?make_dynamic_heap_executable? [-Wimplicit-function-declaration]
> ../image.c:410:9: warning: implicit declaration of function
> ?lower_heap_start? [-Wimplicit-function-declaration]
> ../image.c:415:2: warning: implicit declaration of function
> ?xMakeDataExecutable? [-Wimplicit-function-declaration]
> cc -include ../platform-linuxx8632.h -c ../thread_manager.c -DLINUX
> -D_REENTRANT -DX86 -DX8632 -D_GNU_SOURCE -DHAVE_TLS -DUSE_FUTEX
> -DSVN_REVISION=15201M -g -O2 -Wall -m32 -o thread_manager.o
> ../thread_manager.c: In function ?lock_futex?:
> ../thread_manager.c:271:5: warning: pointer targets in passing
> argument 1 of ?store_conditional? differ in signedness
> [-Wpointer-sign]
> ../thread_manager.c:37:1: note: expected ?natural *? but argument is
> of type ?signed_natural *?
> ../thread_manager.c: In function ?sem_wait_forever?:
> ../thread_manager.c:445:5: warning: implicit declaration of function
> ?gettimeofday? [-Wimplicit-function-declaration]
> ../thread_manager.c: In function ?cooperative_thread_startup?:
> ../thread_manager.c:1719:1: warning: control reaches end of non-void
> function [-Wreturn-type]
> ../thread_manager.c: In function ?create_stack?:
> ../thread_manager.c:1562:1: warning: control reaches end of non-void
> function [-Wreturn-type]
> cc -include ../platform-linuxx8632.h -c ../lisp-debug.c -DLINUX
> -D_REENTRANT -DX86 -DX8632 -D_GNU_SOURCE -DHAVE_TLS -DUSE_FUTEX
> -DSVN_REVISION=15201M -g -O2 -Wall -m32 -o lisp-debug.o
> ../lisp-debug.c: In function ?debug_get_string_value?:
> ../lisp-debug.c:808:5: warning: implicit declaration of function
> ?__fpurge? [-Wimplicit-function-declaration]
> ../lisp-debug.c: In function ?debug_thread_info?:
> ../lisp-debug.c:887:5: warning: format ?%lx? expects argument of type
> ?long unsigned int?, but argument 3 has type ?struct TCR *? [-Wformat]
> ../lisp-debug.c:889:13: warning: format ?%lx? expects argument of type
> ?long unsigned int?, but argument 3 has type ?char *? [-Wformat]
> ../lisp-debug.c:889:13: warning: format ?%lx? expects argument of type
> ?long unsigned int?, but argument 4 has type ?char *? [-Wformat]
> ../lisp-debug.c:891:13: warning: format ?%lx? expects argument of type
> ?long unsigned int?, but argument 3 has type ?unsigned int? [-Wformat]
> ../lisp-debug.c:891:13: warning: format ?%lx? expects argument of type
> ?long unsigned int?, but argument 4 has type ?unsigned int? [-Wformat]
> ../lisp-debug.c:903:15: warning: format ?%lx? expects argument of type
> ?long unsigned int?, but argument 3 has type ?natural? [-Wformat]
> ../lisp-debug.c: In function ?debug_show_fpu?:
> ../lisp-debug.c:1111:12: warning: unused variable ?i? [-Wunused-variable]
> ../lisp-debug.c:1111:8: warning: unused variable ?np? [-Wunused-variable]
> ../lisp-debug.c:1110:11: warning: unused variable ?dp? [-Wunused-variable]
> ../lisp-debug.c: In function ?bit_for_regnum?:
> ../lisp-debug.c:266:1: warning: control reaches end of non-void
> function [-Wreturn-type]
> cc -include ../platform-linuxx8632.h -c ../memory.c -DLINUX
> -D_REENTRANT -DX86 -DX8632 -D_GNU_SOURCE -DHAVE_TLS -DUSE_FUTEX
> -DSVN_REVISION=15201M -g -O2 -Wall -m32 -o memory.o
> ../memory.c: In function ?allocation_failure?:
> ../memory.c:45:3: warning: format ?%ld? expects argument of type ?long
> int?, but argument 4 has type ?natural? [-Wformat]
> ../memory.c: In function ?ReserveMemoryForHeap?:
> ../memory.c:154:3: warning: implicit declaration of function
> ?raise_limit? [-Wimplicit-function-declaration]
> cc -include ../platform-linuxx8632.h -c ../unix-calls.c -DLINUX
> -D_REENTRANT -DX86 -DX8632 -D_GNU_SOURCE -DHAVE_TLS -DUSE_FUTEX
> -DSVN_REVISION=15201M -g -O2 -Wall -m32 -o unix-calls.o
> m4 -DLINUX -DX86 -DX8632 -DHAVE_TLS -I../ ../x86-asmutils32.s | as
> --32 -o x86-asmutils32.o
> m4 -DLINUX -DX86 -DX8632 -DHAVE_TLS -I../ ../imports.s | as --32 -o imports.o
> cc -include ../platform-linuxx8632.h -c ../lispdcmd.c -DLINUX
> -D_REENTRANT -DX86 -DX8632 -D_GNU_SOURCE -DHAVE_TLS -DUSE_FUTEX
> -DSVN_REVISION=15201M -g -O2 -Wall -m32 -o lispdcmd.o
> cc -include ../platform-linuxx8632.h -c ../plprint.c -DLINUX
> -D_REENTRANT -DX86 -DX8632 -D_GNU_SOURCE -DHAVE_TLS -DUSE_FUTEX
> -DSVN_REVISION=15201M -g -O2 -Wall -m32 -o plprint.o
> cc -include ../platform-linuxx8632.h -c ../plsym.c -DLINUX
> -D_REENTRANT -DX86 -DX8632 -D_GNU_SOURCE -DHAVE_TLS -DUSE_FUTEX
> -DSVN_REVISION=15201M -g -O2 -Wall -m32 -o plsym.o
> cc -include ../platform-linuxx8632.h -c ../xlbt.c -DLINUX -D_REENTRANT
> -DX86 -DX8632 -D_GNU_SOURCE -DHAVE_TLS -DUSE_FUTEX
> -DSVN_REVISION=15201M -g -O2 -Wall -m32 -o xlbt.o
> ../xlbt.c: In function ?walk_stack_frames?:
> ../xlbt.c:108:9: warning: format ?%lx? expects argument of type ?long
> unsigned int?, but argument 3 has type ?struct lisp_frame *?
> [-Wformat]
> ../xlbt.c:118:7: warning: format ?%x? expects argument of type
> ?unsigned int?, but argument 3 has type ?struct lisp_frame *?
> [-Wformat]
> ../xlbt.c:118:7: warning: format ?%x? expects argument of type
> ?unsigned int?, but argument 4 has type ?struct lisp_frame *?
> [-Wformat]
> ../xlbt.c: In function ?plbt_sp?:
> ../xlbt.c:157:5: warning: format ?%lx? expects argument of type ?long
> unsigned int?, but argument 3 has type ?struct TCR *? [-Wformat]
> ../xlbt.c:157:5: warning: format ?%lx? expects argument of type ?long
> unsigned int?, but argument 4 has type ?void *? [-Wformat]
> ../xlbt.c:143:19: warning: variable ?cs_area? set but not used
> [-Wunused-but-set-variable]
> cc -include ../platform-linuxx8632.h -c ../x86_print.c -DLINUX
> -D_REENTRANT -DX86 -DX8632 -D_GNU_SOURCE -DHAVE_TLS -DUSE_FUTEX
> -DSVN_REVISION=15201M -g -O2 -Wall -m32 -o x86_print.o
> cc -m32 -g -Wl,--export-dynamic "-Wl,--hash-style=sysv" -o
> ../../lx86cl pad.o x86-spjump32.o x86-spentry32.o x86-subprims32.o
> pmcl-kernel.o gc-common.o x86-gc.o bits.o x86-exceptions.o image.o
> thread_manager.o lisp-debug.o memory.o unix-calls.o x86-asmutils32.o
> imports.o lispdcmd.o plprint.o plsym.o xlbt.o x86_print.o -ldl -lm
> -lpthread
> [epearson at archie linuxx8632]$
>
>
> My originally symptoms whine is below, but may be irrelevant since
> this seems to be a rebuild issue.
>
> Erik.
>
>
>
>
> I'm having a really fun time trying track down an overflow error that
> keeps cropping up while I'm load testing my web server. It occurs in
> the latest 1.7 and 1.8 svn codebase.
> The load testing is via siege on the same machine (I know). The
> results are variable, but I've only had one test complete without
> achieving an overflow (1.2 million page hits with a siege concurrency
> level of 50), otherwise I can usually trigger it within a couple
> hundred thousand page hits. Higher concurrency results in faster
> crash.
> It is a threaded server (thread pooling) with the test set up for 10 threads.
> The stack overflow is normally not fatal to the server -- the thread
> catches the error and resets itself. However, I've also had memory
> corruption on in the call to remote-socket-info, with the results
> being things like:
>
> value #<BOGUS object @ #x195ABB5E> is not of the expected type
> (UNSIGNED-BYTE 32)
> value 1328550583 is not of the expected type (UNSIGNED-BYTE 32)
> and once
> Exception occurred while executing foreign code
> at create_exception_callback_frame + 400
> received signal 11; faulting address: 0x6d
> address not mapped to object
>
> Then again, maybe the seg fault was thrown because I turned debugging
> off (because it was exacerbating the stack overflow).
>
> The stack overflow is during some parsing code that is a bit
> recursive, but not terribly deep at the time. And it runs several
> hundred thousand times quite fine until one thread throws the error.
>
> The platform is:
>
> Linux archie 3.2.4-1-ARCH #1 SMP PREEMPT Sat Feb 4 11:21:15 UTC 2012
> i686 Intel(R) Atom(TM) CPU D525 @ 1.80GHz GenuineIntel GNU/Linux
>
> One clue is that a freshly checked out 1.7-r14925M was the only lisp
> that was able to handle the siege to completion (1.2 million hits).
> After the successful run I installed a little harmless patch (which
> was not invoked by the test code) and did a (rebuild-ccl :full t), and
> managed to only 300K+ page hits before a seg fault, which was
> preceded by several bogus object errors on remote-socket-info.
>
> I quite feeling the deep water at this point, but have tried my best
> to research this before hitting up the list.
>
> I've rewritten the server just about from scratch to eliminate old
> gnarly code (plenty of that).
>
> Inspected the code for remote-socket-info and friends, wondering if
> something allocated and passed out of a function was going out of
> scope but only sometimes being trapped on the wrong side of a gc
> (feeling the deep water).
>
> Had some theories, most of which were just good learning experiences
> (one I haven't resolved yet -- whether a pref of an rletted :signed
> object can be returned from a function?)
>
> Now I'm having strange ideas about the recompilation of ccl itself --
> perhaps the build from svn has a proclaim optimize that the vanilla
> rebuild-ccl usage doesn't? Is it the compiler version installed on my
> Arch Linux? gcc (GCC) 4.6.2 20120120 (prerelease). Oh, look, it says
> "prerelease". I hadn't noticed bfore. I can test with a precompiled
> 1.7, but 1.8 trunk doesn't work out of the box for me -- I need to
> rebuild to get asdf to be happy.
>
> Any ideas welcome.
>
> Thanks,
> Erik.
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>
More information about the Openmcl-devel
mailing list