[Openmcl-devel] Need advice to debug segfault when running concurrent selects in clsql/postgresql

Mon Nov 4 12:23:09 PST 2013

I did do the experiment you proposed.

On the older Xeon 4-core machine, the crashes get still somewhat less frequent, but this might be insignificant because I didn't run enough iterations. A crash occurs not more than every 50th iteration in average. Perhaps not often enough for convenient debugging.

Here are the specs:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU            5140  @ 2.33GHz
stepping        : 6
cpu MHz         : 2327.528
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm co
nstant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips        : 4659.22
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

On the 16-core machine (16 inc. hyperthreading), nothing seems to have changed. The latter has two CPUs with these specs:

	Intel Xeon E5 4-Core - E5-2643 3.30GHz 10MB LGA2011 8.0GT/

or from /proc/cpuinfo:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 45
model name      : Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz
stepping        : 7
cpu MHz         : 3301.000
cache size      : 10240 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips        : 6583.99
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

I also run the code on an AMD Opteron machine, but there, no crashes occur, as far as I can see (after 50 iterations).

> I tried running your example in an infinite loop on another Core-i7 machine.
> After about an hour, it crashed in the way that you describe.  I poked around
> a bit in GDB but wasn't sure what I was seeing; the C code in the CCL kernel
> (including the GC) is usually compiled with the -O2 option, which makes it
> run faster but makes debugging harder.
> 
> I figured that things would go faster if debugging was easier, so I rebuilt
> the kernel without -O2 and tried again.  It's been running for over 24 hours
> at this point without incident.
> 
> Aside from being yet another example of the famous Heisenbug Uncertainty
> Principle, this suggests that how the C code is compiled (by what version
> of what compiler and at what optimization settings) may have something to
> do with the problem (or at least the frequency at which it occurs.)
> 
> I'm curious as to whether building the kernel without -O2 causes things to
> behave differently for you.  To test this:
> 
> $ cd ccl/lisp-kernel/linuxx8664
> Edit the Makefile in that directory, changing the line:
> 
> COPT = -O2
> 
> to
> 
> COPT = #-02
> 
> $ make clean
> $ make
> 
> If the problem still occurs for you with the same frequency that it's been occurring
> on your Xeons, that'd tell us something (the the differences between the Xeon and
> other x8664 machines have more to do with the frequency with which the problem
> occurs than compiler issues do.)  If that change masks or avoids the problem, that'd
> tell us a bit less.  In either case, if you can try this experiment it'd be good to
> know the results.
> If the processor difference remains a likely candidate, it'd be helpful to know
> the exact model number of the (smaller, 4-core) Xeon machine where the problem
> occurs (frequently) for you.  Doing
> 
> $ cat /proc/cpuinfo
> 
> may list this info under "model name" for each core.
> 
> I've been able to reprouduce the problem twice on Core i7 machines in a few days
> of trying, and it'd likely be easiest for me to understand an fix if it was easier
> for me to reproduce.
> 
> On Thu, 31 Oct 2013, Paul Meurer wrote:
> 
>> Am 31.10.2013 um 01:15 schrieb Gary Byers <gb at clozure.com>:
>> 
>>      On Wed, 30 Oct 2013, Paul Meurer wrote:
>>            I run it now with --no-init and in the shell, with no difference. Immediate failure with
>>            :consing in *features*,
>>            bogus objects etc. after several rounds without :consing.
>> 
>>      So, I can't rant and rave about the sorry state of 3rd-party CL libraries, and
>>      anyone reading this won't be subjected to me doing so ?
>> 
>>      Oh well.
>> 
>>      I was able to reproduce the problem by running your test 100 times,
>> I am not able to provoke it at all on the MacBook, and I tried a lot.
>> 
>>      so apparently
>>      I won't be able to blame this on some aspect of your machine. ?(Also unfortunate,
>>      since my ability to diagnose problems that only occur on 16-core machines depends
>>      on my ability to borrow such machines for a few months.)
>> I think you can do without a 16-core machine. I am able to reproduce the failure quite reliably on an older 4-core
>> machine with Xeon CPUs and SuSE, with slightly different code (perhaps to get the timing right):
>> (dotimes (j 100)
>> ? (print (ccl::all-processes))
>> ? (dotimes (i 8)
>> ? ? (process-run-function
>> ? ? ?(format nil "getstring-~a-~a" j i)
>> ? ? ?(lambda (i)
>> ? ? ? ?(let ((list ()))
>> ? ? ? ? ?(dotimes (i 500000)
>> ? ? ? ? ? ?(push (getstring) list)))
>> ? ? ? ?(print i))
>> ? ? ?i))
>> ? (print (list :done j))
>> ? (sleep 1))
>> If you really need a 16-core machine to debug this I can give you access to mine. :-)
>> 
>>            My machine has 16 true cores and hyperthreading; I am running CentOS 6.0, and a recent CCL
>>            1.9 (I did svn update +
>>            rebuild of everything yesterday).
>>            I also observed that the problem goes away when I replace the constant string in the
>>            library by a freshly
>>            allocated string:
>>            char *getstring() {?
>>            ? int index;
>>            ? char *buffer = (char *)calloc(100 + 1, sizeof(char));
>>            ? for (index = 0; index < 100; index++) {
>>            ? ? ? buffer[index] = 'a';
>>            ? ? }
>>            ? buffer[100] = '\0';
>>            ? return buffer ;
>>            }
>>            One should expect the strings in the Postgres library to be freshly allocated, but
>>            nevertheless they behave like
>>            the constant string example.
>> 
>>      It's unlikely that this change directly avoids the bug (whatever it is); it's more
>>      likely that it affects timing (exactly what happens when.) ?I don't yet know what
>>      the bug is, but I think that it's likely that it's fair to characterize the bug
>>      as being "timing-sensitive". ?(For example: from the GC's point of view, whether
>>      a thread is running Lisp or foreign code when that thread is suspended by the GC.
>>      The transition between Lisp and foreign code takes a few instructions, and if
>>      a thread is suspended in the middle of that instruction sequence and the GC
>>      misintrprets its state, very bad things like what you're seeing could occur.
>>      That's not supposed to be possible, but something broadly similar seems to be
>>      happening.)
>> --?
>> Paul
>> 

-- 
Paul

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clozure.com/pipermail/openmcl-devel/attachments/20131104/53e92736/attachment.htm>