[Openmcl-devel] Need advice to debug segfault when running concurrent selects in clsql/postgresql

Mon Nov 4 19:50:42 PST 2013

Thanks.

Coincidentally, clozure.com crashed today; the hosting service moved it to
a new machine (an 8-core Xeon).  Hmmm ...

On Mon, 4 Nov 2013, Paul Meurer wrote:

> I did do the experiment you proposed.
> 
> On the older Xeon 4-core machine, the crashes get still somewhat less frequent, but this?might be insignificant
> because I didn't run enough iterations. A crash occurs not more than?every 50th iteration in average. Perhaps not
> often enough for convenient debugging.
> 
> Here are the specs:
> 
> processor ? ? ? : 0
> vendor_id ? ? ? : GenuineIntel
> cpu family ? ? ?: 6
> model ? ? ? ? ? : 15
> model name ? ? ?: Intel(R) Xeon(R) CPU ? ? ? ? ? ?5140 ?@ 2.33GHz
> stepping ? ? ? ?: 6
> cpu MHz ? ? ? ? : 2327.528
> cache size ? ? ?: 4096 KB
> physical id ? ? : 0
> siblings ? ? ? ?: 2
> core id ? ? ? ? : 0
> cpu cores ? ? ? : 2
> fpu ? ? ? ? ? ? : yes
> fpu_exception ? : yes
> cpuid level ? ? : 10
> wp ? ? ? ? ? ? ?: yes
> flags ? ? ? ? ? : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat?pse36 clflush dts acpi mmx
> fxsr sse sse2 ss ht tm syscall nx lm co
> nstant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
> bogomips ? ? ? ?: 4659.22
> clflush size ? ?: 64
> cache_alignment : 64
> address sizes ? : 36 bits physical, 48 bits virtual
> power management:
> 
> 
> On the 16-core machine (16 inc. hyperthreading), nothing seems to have changed. The?latter has two CPUs with these
> specs:
> 
> Intel Xeon E5 4-Core - E5-2643 3.30GHz 10MB LGA2011 8.0GT/
> 
> or from /proc/cpuinfo:
> 
> processor ? ? ? : 0
> vendor_id ? ? ? : GenuineIntel
> cpu family ? ? ?: 6
> model ? ? ? ? ? : 45
> model name ? ? ?: Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz
> stepping ? ? ? ?: 7
> cpu MHz ? ? ? ? : 3301.000
> cache size ? ? ?: 10240 KB
> physical id ? ? : 0
> siblings ? ? ? ?: 8
> core id ? ? ? ? : 0
> cpu cores ? ? ? : 4
> apicid ? ? ? ? ?: 0
> initial apicid ?: 0
> fpu ? ? ? ? ? ? : yes
> fpu_exception ? : yes
> cpuid level ? ? : 13
> wp ? ? ? ? ? ? ?: yes
> flags ? ? ? ? ? : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov?pat pse36 clflush dts acpi mmx fxsr
> sse sse2 ss ht tm pbe syscall nx?pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good?xtopology
> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx?smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2
> x2apic popcnt aes?xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi?flexpriority ept vpid
> bogomips ? ? ? ?: 6583.99
> clflush size ? ?: 64
> cache_alignment : 64
> address sizes ? : 46 bits physical, 48 bits virtual
> power management:
> 
> I also run the code on an AMD Opteron machine, but there, no crashes occur,?as far as I can see (after 50
> iterations).
>
>       I tried running your example in an infinite loop on another Core-i7 machine.
>       After about an hour, it crashed in the way that you describe. ?I poked around
>       a bit in GDB but wasn't sure what I was seeing; the C code in the CCL kernel
>       (including the GC) is usually compiled with the -O2 option, which makes it
>       run faster but makes debugging harder.
>
>       I figured that things would go faster if debugging was easier, so I rebuilt
>       the kernel without -O2 and tried again. ?It's been running for over 24 hours
>       at this point without incident.
>
>       Aside from being yet another example of the famous Heisenbug Uncertainty
>       Principle, this suggests that how the C code is compiled (by what version
>       of what compiler and at what optimization settings) may have something to
>       do with the problem (or at least the frequency at which it occurs.)
>
>       I'm curious as to whether building the kernel without -O2 causes things to
>       behave differently for you. ?To test this:
>
>       $ cd ccl/lisp-kernel/linuxx8664
>       Edit the Makefile in that directory, changing the line:
>
>       COPT = -O2
>
>       to
>
>       COPT = #-02
>
>       $ make clean
>       $ make
>
>       If the problem still occurs for you with the same frequency that it's been occurring
>       on your Xeons, that'd tell us something (the the differences between the Xeon and
>       other x8664 machines have more to do with the frequency with which the problem
>       occurs than compiler issues do.) ?If that change masks or avoids the problem, that'd
>       tell us a bit less. ?In either case, if you can try this experiment it'd be good to
>       know the results.
>
>       If the processor difference remains a likely candidate, it'd be helpful to know
>       the exact model number of the (smaller, 4-core) Xeon machine where the problem
>       occurs (frequently) for you. ?Doing
>
>       $ cat /proc/cpuinfo
>
>       may list this info under "model name" for each core.
>
>       I've been able to reprouduce the problem twice on Core i7 machines in a few days
>       of trying, and it'd likely be easiest for me to understand an fix if it was easier
>       for me to reproduce.
>
>       On Thu, 31 Oct 2013, Paul Meurer wrote:
>
>             Am 31.10.2013 um 01:15 schrieb Gary Byers <gb at clozure.com>:
>
>             ? ? ?On Wed, 30 Oct 2013, Paul Meurer wrote:
>             ? ? ? ? ? ?I run it now with --no-init and in the shell, with no difference. Immediate
>             failure?with
>             ? ? ? ? ? ?:consing in *features*,
>             ? ? ? ? ? ?bogus objects etc. after several rounds without :consing.
>
>             ? ? ?So, I can't rant and rave about the sorry state of 3rd-party CL libraries, and
>             ? ? ?anyone reading this won't be subjected to me doing so ?
>
>             ? ? ?Oh well.
>
>             ? ? ?I was able to reproduce the problem by running your test 100 times,
>             I am not able to provoke it at all on the MacBook, and I tried a lot.
>
>             ? ? ?so apparently
>             ? ? ?I won't be able to blame this on some aspect of your machine. ?(Also unfortunate,
>             ? ? ?since my ability to diagnose problems that only occur on 16-core machines depends
>             ? ? ?on my ability to borrow such machines for a few months.)
>             I think you can do without a 16-core machine. I am able to reproduce the failure
>             quite?reliably on an older 4-core
>             machine with Xeon CPUs and SuSE, with slightly different code (perhaps to get the?timing
>             right):
>             (dotimes (j 100)
>             ? (print (ccl::all-processes))
>             ? (dotimes (i 8)
>             ? ? (process-run-function
>             ? ? ?(format nil "getstring-~a-~a" j i)
>             ? ? ?(lambda (i)
>             ? ? ? ?(let ((list ()))
>             ? ? ? ? ?(dotimes (i 500000)
>             ? ? ? ? ? ?(push (getstring) list)))
>             ? ? ? ?(print i))
>             ? ? ?i))
>             ? (print (list :done j))
>             ? (sleep 1))
>             If you really need a 16-core machine to debug this I can give you access to mine. :-)
>
>             ? ? ? ? ? ?My machine has 16 true cores and hyperthreading; I am running CentOS 6.0,?and a
>             recent CCL
>             ? ? ? ? ? ?1.9 (I did svn update +
>             ? ? ? ? ? ?rebuild of everything yesterday).
>             ? ? ? ? ? ?I also observed that the problem goes away when I replace the constant string
>             in?the
>             ? ? ? ? ? ?library by a freshly
>             ? ? ? ? ? ?allocated string:
>             ? ? ? ? ? ?char *getstring() {?
>             ? ? ? ? ? ?? int index;
>             ? ? ? ? ? ?? char *buffer = (char *)calloc(100 + 1, sizeof(char));
>             ? ? ? ? ? ?? for (index = 0; index < 100; index++) {
>             ? ? ? ? ? ?? ? ? buffer[index] = 'a';
>             ? ? ? ? ? ?? ? }
>             ? ? ? ? ? ?? buffer[100] = '\0';
>             ? ? ? ? ? ?? return buffer ;
>             ? ? ? ? ? ?}
>             ? ? ? ? ? ?One should expect the strings in the Postgres library to be freshly allocated,
>             but
>             ? ? ? ? ? ?nevertheless they behave like
>             ? ? ? ? ? ?the constant string example.
>
>             ? ? ?It's unlikely that this change directly avoids the bug (whatever it is); it's more
>             ? ? ?likely that it affects timing (exactly what happens when.) ?I don't yet know what
>             ? ? ?the bug is, but I think that it's likely that it's fair to characterize the bug
>             ? ? ?as being "timing-sensitive". ?(For example: from the GC's point of view, whether
>             ? ? ?a thread is running Lisp or foreign code when that thread is suspended by the GC.
>             ? ? ?The transition between Lisp and foreign code takes a few instructions, and if
>             ? ? ?a thread is suspended in the middle of that instruction sequence and the GC
>             ? ? ?misintrprets its state, very bad things like what you're seeing could occur.
>             ? ? ?That's not supposed to be possible, but something broadly similar seems to be
>             ? ? ?happening.)
>             --?
>             Paul
> 
> 
> --?
> Paul
> 
> 
>