<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><span style="font-size: 13px; ">I did do the experiment you proposed.</span><br style="font-size: 13px; "><br style="font-size: 13px; "><span style="font-size: 13px; ">On the older Xeon 4-core machine, the crashes get still somewhat less frequent, but this </span><span style="font-size: 13px; ">might be insignificant because I didn't run enough iterations. A crash occurs not more than </span><span style="font-size: 13px; ">every 50th iteration in average. Perhaps not often enough for convenient debugging.</span><br style="font-size: 13px; "><br style="font-size: 13px; "><span style="font-size: 13px; ">Here are the specs:</span><br style="font-size: 13px; "><br style="font-size: 13px; "><span style="font-size: 13px; ">processor : 0</span><br style="font-size: 13px; "><span style="font-size: 13px; ">vendor_id : GenuineIntel</span><br style="font-size: 13px; "><span style="font-size: 13px; ">cpu family : 6</span><br style="font-size: 13px; "><span style="font-size: 13px; ">model : 15</span><br style="font-size: 13px; "><span style="font-size: 13px; ">model name : Intel(R) Xeon(R) CPU 5140 @ 2.33GHz</span><br style="font-size: 13px; "><span style="font-size: 13px; ">stepping : 6</span><br style="font-size: 13px; "><span style="font-size: 13px; ">cpu MHz : 2327.528</span><br style="font-size: 13px; "><span style="font-size: 13px; ">cache size : 4096 KB</span><br style="font-size: 13px; "><span style="font-size: 13px; ">physical id : 0</span><br style="font-size: 13px; "><span style="font-size: 13px; ">siblings : 2</span><br style="font-size: 13px; "><span style="font-size: 13px; ">core id : 0</span><br style="font-size: 13px; "><span style="font-size: 13px; ">cpu cores : 2</span><br style="font-size: 13px; "><span style="font-size: 13px; ">fpu : yes</span><br style="font-size: 13px; "><span style="font-size: 13px; ">fpu_exception : yes</span><br style="font-size: 13px; "><span style="font-size: 13px; ">cpuid level : 10</span><br style="font-size: 13px; "><span style="font-size: 13px; ">wp : yes</span><br style="font-size: 13px; "><span style="font-size: 13px; ">flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat </span><span style="font-size: 13px; ">pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm co</span><br style="font-size: 13px; "><span style="font-size: 13px; ">nstant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm</span><br style="font-size: 13px; "><span style="font-size: 13px; ">bogomips : 4659.22</span><br style="font-size: 13px; "><span style="font-size: 13px; ">clflush size : 64</span><br style="font-size: 13px; "><span style="font-size: 13px; ">cache_alignment : 64</span><br style="font-size: 13px; "><span style="font-size: 13px; ">address sizes : 36 bits physical, 48 bits virtual</span><br style="font-size: 13px; "><span style="font-size: 13px; ">power management:</span><br style="font-size: 13px; "><br style="font-size: 13px; "><br style="font-size: 13px; "><span style="font-size: 13px; ">On the 16-core machine (16 inc. hyperthreading), nothing seems to have changed. The </span><span style="font-size: 13px; ">latter has two CPUs with these specs:</span><br style="font-size: 13px; "><br style="font-size: 13px; "><span class="Apple-tab-span" style="white-space: pre; font-size: 13px; "> </span><span style="font-size: 13px; ">Intel Xeon E5 4-Core - E5-2643 3.30GHz 10MB LGA2011 8.0GT/</span><br style="font-size: 13px; "><br style="font-size: 13px; "><span style="font-size: 13px; ">or from /proc/cpuinfo:</span><br style="font-size: 13px; "><br style="font-size: 13px; "><span style="font-size: 13px; ">processor : 0</span><br style="font-size: 13px; "><span style="font-size: 13px; ">vendor_id : GenuineIntel</span><br style="font-size: 13px; "><span style="font-size: 13px; ">cpu family : 6</span><br style="font-size: 13px; "><span style="font-size: 13px; ">model : 45</span><br style="font-size: 13px; "><span style="font-size: 13px; ">model name : Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz</span><br style="font-size: 13px; "><span style="font-size: 13px; ">stepping : 7</span><br style="font-size: 13px; "><span style="font-size: 13px; ">cpu MHz : 3301.000</span><br style="font-size: 13px; "><span style="font-size: 13px; ">cache size : 10240 KB</span><br style="font-size: 13px; "><span style="font-size: 13px; ">physical id : 0</span><br style="font-size: 13px; "><span style="font-size: 13px; ">siblings : 8</span><br style="font-size: 13px; "><span style="font-size: 13px; ">core id : 0</span><br style="font-size: 13px; "><span style="font-size: 13px; ">cpu cores : 4</span><br style="font-size: 13px; "><span style="font-size: 13px; ">apicid : 0</span><br style="font-size: 13px; "><span style="font-size: 13px; ">initial apicid : 0</span><br style="font-size: 13px; "><span style="font-size: 13px; ">fpu : yes</span><br style="font-size: 13px; "><span style="font-size: 13px; ">fpu_exception : yes</span><br style="font-size: 13px; "><span style="font-size: 13px; ">cpuid level : 13</span><br style="font-size: 13px; "><span style="font-size: 13px; ">wp : yes</span><br style="font-size: 13px; "><span style="font-size: 13px; ">flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov </span><span style="font-size: 13px; ">pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx </span><span style="font-size: 13px; ">pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good </span><span style="font-size: 13px; ">xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx </span><span style="font-size: 13px; ">smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes </span><span style="font-size: 13px; ">xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi </span><span style="font-size: 13px; ">flexpriority ept vpid</span><br style="font-size: 13px; "><span style="font-size: 13px; ">bogomips : 6583.99</span><br style="font-size: 13px; "><span style="font-size: 13px; ">clflush size : 64</span><br style="font-size: 13px; "><span style="font-size: 13px; ">cache_alignment : 64</span><br style="font-size: 13px; "><span style="font-size: 13px; ">address sizes : 46 bits physical, 48 bits virtual</span><br style="font-size: 13px; "><span style="font-size: 13px; ">power management:</span><br style="font-size: 13px; "><br style="font-size: 13px; "><span style="font-size: 13px; ">I also run the code on an AMD Opteron machine, but there, no crashes occur, </span><span style="font-size: 13px; ">as far as I can see (after 50 iterations).</span><br style="font-size: 13px; "><br style="font-size: 13px; "><blockquote type="cite" style="font-family: Helvetica; font-size: 13px; ">I tried running your example in an infinite loop on another Core-i7 machine.<br>After about an hour, it crashed in the way that you describe. I poked around<br>a bit in GDB but wasn't sure what I was seeing; the C code in the CCL kernel<br>(including the GC) is usually compiled with the -O2 option, which makes it<br>run faster but makes debugging harder.<br><br>I figured that things would go faster if debugging was easier, so I rebuilt<br>the kernel without -O2 and tried again. It's been running for over 24 hours<br>at this point without incident.<br><br>Aside from being yet another example of the famous Heisenbug Uncertainty<br>Principle, this suggests that how the C code is compiled (by what version<br>of what compiler and at what optimization settings) may have something to<br>do with the problem (or at least the frequency at which it occurs.)<br><br>I'm curious as to whether building the kernel without -O2 causes things to<br>behave differently for you. To test this:<br><br>$ cd ccl/lisp-kernel/linuxx8664<br>Edit the Makefile in that directory, changing the line:<br><br>COPT = -O2<br><br>to<br><br>COPT = #-02<br><br>$ make clean<br>$ make<br><br>If the problem still occurs for you with the same frequency that it's been occurring<br>on your Xeons, that'd tell us something (the the differences between the Xeon and<br>other x8664 machines have more to do with the frequency with which the problem<br>occurs than compiler issues do.) If that change masks or avoids the problem, that'd<br>tell us a bit less. In either case, if you can try this experiment it'd be good to<br>know the results.<br></blockquote><blockquote type="cite" style="font-size: 13px; ">If the processor difference remains a likely candidate, it'd be helpful to know<br>the exact model number of the (smaller, 4-core) Xeon machine where the problem<br>occurs (frequently) for you. Doing<br><br>$ cat /proc/cpuinfo<br><br>may list this info under "model name" for each core.<br><br>I've been able to reprouduce the problem twice on Core i7 machines in a few days<br>of trying, and it'd likely be easiest for me to understand an fix if it was easier<br>for me to reproduce.<br><br>On Thu, 31 Oct 2013, Paul Meurer wrote:<br><br><blockquote type="cite">Am 31.10.2013 um 01:15 schrieb Gary Byers <<a href="mailto:gb@clozure.com">gb@clozure.com</a>>:<br><br> On Wed, 30 Oct 2013, Paul Meurer wrote:<br> I run it now with --no-init and in the shell, with no difference. Immediate failure with<br> :consing in *features*,<br> bogus objects etc. after several rounds without :consing.<br><br> So, I can't rant and rave about the sorry state of 3rd-party CL libraries, and<br> anyone reading this won't be subjected to me doing so ?<br><br> Oh well.<br><br> I was able to reproduce the problem by running your test 100 times,<br>I am not able to provoke it at all on the MacBook, and I tried a lot.<br><br> so apparently<br> I won't be able to blame this on some aspect of your machine. ?(Also unfortunate,<br> since my ability to diagnose problems that only occur on 16-core machines depends<br> on my ability to borrow such machines for a few months.)<br>I think you can do without a 16-core machine. I am able to reproduce the failure quite reliably on an older 4-core<br>machine with Xeon CPUs and SuSE, with slightly different code (perhaps to get the timing right):<br>(dotimes (j 100)<br>? (print (ccl::all-processes))<br>? (dotimes (i 8)<br>? ? (process-run-function<br>? ? ?(format nil "getstring-~a-~a" j i)<br>? ? ?(lambda (i)<br>? ? ? ?(let ((list ()))<br>? ? ? ? ?(dotimes (i 500000)<br>? ? ? ? ? ?(push (getstring) list)))<br>? ? ? ?(print i))<br>? ? ?i))<br>? (print (list :done j))<br>? (sleep 1))<br>If you really need a 16-core machine to debug this I can give you access to mine. :-)<br><br> My machine has 16 true cores and hyperthreading; I am running CentOS 6.0, and a recent CCL<br> 1.9 (I did svn update +<br> rebuild of everything yesterday).<br> I also observed that the problem goes away when I replace the constant string in the<br> library by a freshly<br> allocated string:<br> char *getstring() {?<br> ? int index;<br> ? char *buffer = (char *)calloc(100 + 1, sizeof(char));<br> ? for (index = 0; index < 100; index++) {<br> ? ? ? buffer[index] = 'a';<br> ? ? }<br> ? buffer[100] = '\0';<br> ? return buffer ;<br> }<br> One should expect the strings in the Postgres library to be freshly allocated, but<br> nevertheless they behave like<br> the constant string example.<br><br> It's unlikely that this change directly avoids the bug (whatever it is); it's more<br> likely that it affects timing (exactly what happens when.) ?I don't yet know what<br> the bug is, but I think that it's likely that it's fair to characterize the bug<br> as being "timing-sensitive". ?(For example: from the GC's point of view, whether<br> a thread is running Lisp or foreign code when that thread is suspended by the GC.<br> The transition between Lisp and foreign code takes a few instructions, and if<br> a thread is suspended in the middle of that instruction sequence and the GC<br> misintrprets its state, very bad things like what you're seeing could occur.<br> That's not supposed to be possible, but something broadly similar seems to be<br> happening.)<br>--?<br>Paul<br><br></blockquote></blockquote><br style="font-size: 13px; "><span style="font-size: 13px; ">-- </span><br style="font-size: 13px; "><span style="font-size: 13px; ">Paul</span><br style="font-size: 13px; "><br></body></html>