[Openmcl-devel] Need advice to debug segfault when running concurrent selects in clsql/postgresql

Sun Nov 3 19:55:09 PST 2013

I tried running your example in an infinite loop on another Core-i7 machine.
After about an hour, it crashed in the way that you describe.  I poked around
a bit in GDB but wasn't sure what I was seeing; the C code in the CCL kernel
(including the GC) is usually compiled with the -O2 option, which makes it
run faster but makes debugging harder.

I figured that things would go faster if debugging was easier, so I rebuilt
the kernel without -O2 and tried again.  It's been running for over 24 hours
at this point without incident.

Aside from being yet another example of the famous Heisenbug Uncertainty
Principle, this suggests that how the C code is compiled (by what version
of what compiler and at what optimization settings) may have something to
do with the problem (or at least the frequency at which it occurs.)

I'm curious as to whether building the kernel without -O2 causes things to
behave differently for you.  To test this:

$ cd ccl/lisp-kernel/linuxx8664
Edit the Makefile in that directory, changing the line:

COPT = -O2

to

COPT = #-02

$ make clean
$ make

If the problem still occurs for you with the same frequency that it's been occurring
on your Xeons, that'd tell us something (the the differences between the Xeon and
other x8664 machines have more to do with the frequency with which the problem
occurs than compiler issues do.)  If that change masks or avoids the problem, that'd
tell us a bit less.  In either case, if you can try this experiment it'd be good to
know the results.

If the processor difference remains a likely candidate, it'd be helpful to know
the exact model number of the (smaller, 4-core) Xeon machine where the problem
occurs (frequently) for you.  Doing

$ cat /proc/cpuinfo

may list this info under "model name" for each core.

I've been able to reprouduce the problem twice on Core i7 machines in a few days
of trying, and it'd likely be easiest for me to understand an fix if it was easier
for me to reproduce.

On Thu, 31 Oct 2013, Paul Meurer wrote:

> 
> Am 31.10.2013 um 01:15 schrieb Gary Byers <gb at clozure.com>:
>
>       On Wed, 30 Oct 2013, Paul Meurer wrote:
>             I run it now with --no-init and in the shell, with no difference. Immediate failure with
>             :consing in *features*,
>             bogus objects etc. after several rounds without :consing.
> 
>
>       So, I can't rant and rave about the sorry state of 3rd-party CL libraries, and
>       anyone reading this won't be subjected to me doing so ?
>
>       Oh well.
>
>       I was able to reproduce the problem by running your test 100 times,
> 
> 
> I am not able to provoke it at all on the MacBook, and I tried a lot.
>
>       so apparently
>       I won't be able to blame this on some aspect of your machine. ?(Also unfortunate,
>       since my ability to diagnose problems that only occur on 16-core machines depends
>       on my ability to borrow such machines for a few months.)
> 
> 
> I think you can do without a 16-core machine. I am able to reproduce the failure quite reliably on an older 4-core
> machine with Xeon CPUs and SuSE, with slightly different code (perhaps to get the timing right):
> 
> (dotimes (j 100)
> ? (print (ccl::all-processes))
> ? (dotimes (i 8)
> ? ? (process-run-function
> ? ? ?(format nil "getstring-~a-~a" j i)
> ? ? ?(lambda (i)
> ? ? ? ?(let ((list ()))
> ? ? ? ? ?(dotimes (i 500000)
> ? ? ? ? ? ?(push (getstring) list)))
> ? ? ? ?(print i))
> ? ? ?i))
> ? (print (list :done j))
> ? (sleep 1))
> 
> If you really need a 16-core machine to debug this I can give you access to mine. :-)
>
>             My machine has 16 true cores and hyperthreading; I am running CentOS 6.0, and a recent CCL
>             1.9 (I did svn update +
>             rebuild of everything yesterday).
>             I also observed that the problem goes away when I replace the constant string in the
>             library by a freshly
>             allocated string:
>             char *getstring() {?
>             ? int index;
>             ? char *buffer = (char *)calloc(100 + 1, sizeof(char));
>             ? for (index = 0; index < 100; index++) {
>             ? ? ? buffer[index] = 'a';
>             ? ? }
>             ? buffer[100] = '\0';
>             ? return buffer ;
>             }
>             One should expect the strings in the Postgres library to be freshly allocated, but
>             nevertheless they behave like
>             the constant string example.
> 
>
>       It's unlikely that this change directly avoids the bug (whatever it is); it's more
>       likely that it affects timing (exactly what happens when.) ?I don't yet know what
>       the bug is, but I think that it's likely that it's fair to characterize the bug
>       as being "timing-sensitive". ?(For example: from the GC's point of view, whether
>       a thread is running Lisp or foreign code when that thread is suspended by the GC.
>       The transition between Lisp and foreign code takes a few instructions, and if
>       a thread is suspended in the middle of that instruction sequence and the GC
>       misintrprets its state, very bad things like what you're seeing could occur.
>       That's not supposed to be possible, but something broadly similar seems to be
>       happening.)
> 
> 
> --?
> Paul
> 
> 
>