[Openmcl-devel] Need advice to debug segfault when running concurrent selects in clsql/postgresql
Gary Byers
gb at clozure.com
Mon Nov 4 19:50:42 PST 2013
Thanks.
Coincidentally, clozure.com crashed today; the hosting service moved it to
a new machine (an 8-core Xeon). Hmmm ...
On Mon, 4 Nov 2013, Paul Meurer wrote:
> I did do the experiment you proposed.
>
> On the older Xeon 4-core machine, the crashes get still somewhat less frequent, but this?might be insignificant
> because I didn't run enough iterations. A crash occurs not more than?every 50th iteration in average. Perhaps not
> often enough for convenient debugging.
>
> Here are the specs:
>
> processor ? ? ? : 0
> vendor_id ? ? ? : GenuineIntel
> cpu family ? ? ?: 6
> model ? ? ? ? ? : 15
> model name ? ? ?: Intel(R) Xeon(R) CPU ? ? ? ? ? ?5140 ?@ 2.33GHz
> stepping ? ? ? ?: 6
> cpu MHz ? ? ? ? : 2327.528
> cache size ? ? ?: 4096 KB
> physical id ? ? : 0
> siblings ? ? ? ?: 2
> core id ? ? ? ? : 0
> cpu cores ? ? ? : 2
> fpu ? ? ? ? ? ? : yes
> fpu_exception ? : yes
> cpuid level ? ? : 10
> wp ? ? ? ? ? ? ?: yes
> flags ? ? ? ? ? : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat?pse36 clflush dts acpi mmx
> fxsr sse sse2 ss ht tm syscall nx lm co
> nstant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
> bogomips ? ? ? ?: 4659.22
> clflush size ? ?: 64
> cache_alignment : 64
> address sizes ? : 36 bits physical, 48 bits virtual
> power management:
>
>
> On the 16-core machine (16 inc. hyperthreading), nothing seems to have changed. The?latter has two CPUs with these
> specs:
>
> Intel Xeon E5 4-Core - E5-2643 3.30GHz 10MB LGA2011 8.0GT/
>
> or from /proc/cpuinfo:
>
> processor ? ? ? : 0
> vendor_id ? ? ? : GenuineIntel
> cpu family ? ? ?: 6
> model ? ? ? ? ? : 45
> model name ? ? ?: Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz
> stepping ? ? ? ?: 7
> cpu MHz ? ? ? ? : 3301.000
> cache size ? ? ?: 10240 KB
> physical id ? ? : 0
> siblings ? ? ? ?: 8
> core id ? ? ? ? : 0
> cpu cores ? ? ? : 4
> apicid ? ? ? ? ?: 0
> initial apicid ?: 0
> fpu ? ? ? ? ? ? : yes
> fpu_exception ? : yes
> cpuid level ? ? : 13
> wp ? ? ? ? ? ? ?: yes
> flags ? ? ? ? ? : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov?pat pse36 clflush dts acpi mmx fxsr
> sse sse2 ss ht tm pbe syscall nx?pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good?xtopology
> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx?smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2
> x2apic popcnt aes?xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi?flexpriority ept vpid
> bogomips ? ? ? ?: 6583.99
> clflush size ? ?: 64
> cache_alignment : 64
> address sizes ? : 46 bits physical, 48 bits virtual
> power management:
>
> I also run the code on an AMD Opteron machine, but there, no crashes occur,?as far as I can see (after 50
> iterations).
>
> I tried running your example in an infinite loop on another Core-i7 machine.
> After about an hour, it crashed in the way that you describe. ?I poked around
> a bit in GDB but wasn't sure what I was seeing; the C code in the CCL kernel
> (including the GC) is usually compiled with the -O2 option, which makes it
> run faster but makes debugging harder.
>
> I figured that things would go faster if debugging was easier, so I rebuilt
> the kernel without -O2 and tried again. ?It's been running for over 24 hours
> at this point without incident.
>
> Aside from being yet another example of the famous Heisenbug Uncertainty
> Principle, this suggests that how the C code is compiled (by what version
> of what compiler and at what optimization settings) may have something to
> do with the problem (or at least the frequency at which it occurs.)
>
> I'm curious as to whether building the kernel without -O2 causes things to
> behave differently for you. ?To test this:
>
> $ cd ccl/lisp-kernel/linuxx8664
> Edit the Makefile in that directory, changing the line:
>
> COPT = -O2
>
> to
>
> COPT = #-02
>
> $ make clean
> $ make
>
> If the problem still occurs for you with the same frequency that it's been occurring
> on your Xeons, that'd tell us something (the the differences between the Xeon and
> other x8664 machines have more to do with the frequency with which the problem
> occurs than compiler issues do.) ?If that change masks or avoids the problem, that'd
> tell us a bit less. ?In either case, if you can try this experiment it'd be good to
> know the results.
>
> If the processor difference remains a likely candidate, it'd be helpful to know
> the exact model number of the (smaller, 4-core) Xeon machine where the problem
> occurs (frequently) for you. ?Doing
>
> $ cat /proc/cpuinfo
>
> may list this info under "model name" for each core.
>
> I've been able to reprouduce the problem twice on Core i7 machines in a few days
> of trying, and it'd likely be easiest for me to understand an fix if it was easier
> for me to reproduce.
>
> On Thu, 31 Oct 2013, Paul Meurer wrote:
>
> Am 31.10.2013 um 01:15 schrieb Gary Byers <gb at clozure.com>:
>
> ? ? ?On Wed, 30 Oct 2013, Paul Meurer wrote:
> ? ? ? ? ? ?I run it now with --no-init and in the shell, with no difference. Immediate
> failure?with
> ? ? ? ? ? ?:consing in *features*,
> ? ? ? ? ? ?bogus objects etc. after several rounds without :consing.
>
> ? ? ?So, I can't rant and rave about the sorry state of 3rd-party CL libraries, and
> ? ? ?anyone reading this won't be subjected to me doing so ?
>
> ? ? ?Oh well.
>
> ? ? ?I was able to reproduce the problem by running your test 100 times,
> I am not able to provoke it at all on the MacBook, and I tried a lot.
>
> ? ? ?so apparently
> ? ? ?I won't be able to blame this on some aspect of your machine. ?(Also unfortunate,
> ? ? ?since my ability to diagnose problems that only occur on 16-core machines depends
> ? ? ?on my ability to borrow such machines for a few months.)
> I think you can do without a 16-core machine. I am able to reproduce the failure
> quite?reliably on an older 4-core
> machine with Xeon CPUs and SuSE, with slightly different code (perhaps to get the?timing
> right):
> (dotimes (j 100)
> ? (print (ccl::all-processes))
> ? (dotimes (i 8)
> ? ? (process-run-function
> ? ? ?(format nil "getstring-~a-~a" j i)
> ? ? ?(lambda (i)
> ? ? ? ?(let ((list ()))
> ? ? ? ? ?(dotimes (i 500000)
> ? ? ? ? ? ?(push (getstring) list)))
> ? ? ? ?(print i))
> ? ? ?i))
> ? (print (list :done j))
> ? (sleep 1))
> If you really need a 16-core machine to debug this I can give you access to mine. :-)
>
> ? ? ? ? ? ?My machine has 16 true cores and hyperthreading; I am running CentOS 6.0,?and a
> recent CCL
> ? ? ? ? ? ?1.9 (I did svn update +
> ? ? ? ? ? ?rebuild of everything yesterday).
> ? ? ? ? ? ?I also observed that the problem goes away when I replace the constant string
> in?the
> ? ? ? ? ? ?library by a freshly
> ? ? ? ? ? ?allocated string:
> ? ? ? ? ? ?char *getstring() {?
> ? ? ? ? ? ?? int index;
> ? ? ? ? ? ?? char *buffer = (char *)calloc(100 + 1, sizeof(char));
> ? ? ? ? ? ?? for (index = 0; index < 100; index++) {
> ? ? ? ? ? ?? ? ? buffer[index] = 'a';
> ? ? ? ? ? ?? ? }
> ? ? ? ? ? ?? buffer[100] = '\0';
> ? ? ? ? ? ?? return buffer ;
> ? ? ? ? ? ?}
> ? ? ? ? ? ?One should expect the strings in the Postgres library to be freshly allocated,
> but
> ? ? ? ? ? ?nevertheless they behave like
> ? ? ? ? ? ?the constant string example.
>
> ? ? ?It's unlikely that this change directly avoids the bug (whatever it is); it's more
> ? ? ?likely that it affects timing (exactly what happens when.) ?I don't yet know what
> ? ? ?the bug is, but I think that it's likely that it's fair to characterize the bug
> ? ? ?as being "timing-sensitive". ?(For example: from the GC's point of view, whether
> ? ? ?a thread is running Lisp or foreign code when that thread is suspended by the GC.
> ? ? ?The transition between Lisp and foreign code takes a few instructions, and if
> ? ? ?a thread is suspended in the middle of that instruction sequence and the GC
> ? ? ?misintrprets its state, very bad things like what you're seeing could occur.
> ? ? ?That's not supposed to be possible, but something broadly similar seems to be
> ? ? ?happening.)
> --?
> Paul
>
>
> --?
> Paul
>
>
>
More information about the Openmcl-devel
mailing list