[Openmcl-devel] Need advice to debug segfault when running concurrent selects in clsql/postgresql

Tue Dec 10 11:07:17 PST 2013

Am 09.12.2013 um 21:23 schrieb Gary Byers <gb at clozure.com>:

> Rumors that I've become distracted and forgotten about this are at least
> slightly exaggerated.
> 
> Could you (Paul) please try to reproduce the problem in the current trunk (r15975 or later) ?  I haven't been able to do so, but I often found
> the problem harder to reproduce than you did.

I did test the newest trunk (after runnnig (rebuild-ccl :full t), giving me Welcome to Clozure Common Lisp Version 1.10-dev-r15975M-trunk  (LinuxX8664)!).

Unfortunately, the problem remains the same, I am getting the same two types of symptoms you are describing.

> 
> On Thu, 7 Nov 2013, Gary Byers wrote:
> 
>> [Sorry if I'm bombarding this list with a discussion that may not be of
>> wide interest.  If you aren't interested in this and don't read this message,
>> rest assured that you aren't likely to be missing anything ...]
>> 
>> The bug that Paul Meurer found seems to have two classes of symptoms:
>> 
>> 1) a hard crash into the kernel debugger, usually due to a GC function
>>  (set_n_bits()) being confused about the number of elements in a vector-like
>>  object.  That's typically a sign of memory corruption (unsafe lisp code
>>  storing outside the bounds of an object.)
>> 
>> 2) a (small) variety of memory faults and type errors signaled as lisp conditions.
>>  (Sometimes. these are followed by case(1) soon after.)  These kinds of errors
>>  can also be symptoms of the same kind of "memory corruption".
>> 
>> When it was first reported, I suggested enabling some consistency checks in
>> the GC that often find problems like this near the time they occur.  Paul
>> said that he tried enabling those checks but they didn't seem to detect
>> any problems but the crashes and faults continued to occur.
>> 
>> I can reproduce the problems more reliably now than I'd been able to and can
>> look at things in GDB when they occur.  One of my favorite class 2 symptoms
>> so far is:
>> 
>>> Error: The value #<A Foreign Pointer #x7FFFD6C15690> is not of the expected type MACPTR.
>>> While executing: %GET-CSTRING, in process getstring1(71).
>> 
>> which is sort of saying "that thing that should have been a foreign pointer
>> (aka MACPTR) wasn't of that type when I looked at it a few cycles ago.  Never
>> mind the fact that it is of that type now."  Of course that makes no sense,
>> and of course it "can't happen."
>> 
>> This is vaguely reminiscent of a situation that arises on machines that have
>> seperate I and D caches:  we write some code to memory and execute it (without
>> properly synchronizing the caches first), and what actually gets executed
>> doesn't have anything to do with the code in memory.  If that happens often
>> enough or has happened recently, it's possible to be suspicious of that issue;
>> otherwise, it's a very confusing situation.
>> 
>> So: some number of cores are running Paul's test case; they're doing speculative
>> memory accesses and doing things out-of-order and doing all of the things that
>> a modern CPU does (all the while maintaining the illusion - mostly - that things
>> are happening in "program" or "architectural" order.  Some thread needs to
>> run the GC, so that thread suspends all other threads.  When a thread is suspended,
>> its "context" (architecturally-visible register values) are store someplace where
>> the GC can find it; the GC will use those register values to identify reachable
>> objects and if it moves such objects it updates the register values in the saved
>> context info.  When the GC finishes, it causes the suspended thread to resume
>> execution with the updated (architecturally visible) register values; the behavior
>> that we're seeing is as if the core that the thread was running on resumes the
>> speculative, out-of-order execution that it was doing (that was based on old
>> register values and we see things that can't happen happen.
>> 
>> Some x86 instructions force things to synchronize with the "architectural" state.
>> When the GC suspends a thread, it atomically increments a counter associated with
>> the thread and only actually does the OS-level suspension if that counter changed
>> from 0 to 1.  Only one thread at a time can increment that counter (that didn't
>> used to be the case), but it still does an "atomic" incf of the counter because
>> the instructions involved are supposed to have this synchronizing effect. They
>> may do so, but the thread can resume its wacky out-of-order speculative execution
>> as soon as the counter has incremented and it won't actually be suspended until
>> some number of cycles later.
>> 
>> I've tried to force things to synchronize after all other threads have suspended
>> and haven't seen the problem(s) occur since that change. That's not conclusive
>> (since none of this is entirely deterministic), but it seems encouraging.
>> 
>> If this does explain the problems, it's a little odd that we haven't seen the
>> problem occur more often (assuming that we haven't.)  I assume that a number
>> of things have to happen just right (or "just wrong") in order for the symptoms
>> above to occur.
>> 
>> I'll keep looking at this as time permits and if the results hold I'll try to
>> check in the change in the next few days.
>> 
>> On Thu, 7 Nov 2013, Gary Byers wrote:
>> 
>>> I didn't have any luck reproducing the problem on that Xeon (or on another
>>> one that rme pointed out we had), but I think that I found a way to make
>>> the problem occur much more reliably than it has been.
>>> What I've been doing (based on your code) is something like:
>>> (dotimes (i n)  ; n is some large number
>>> (dotimes (j m) ; m is proportional to the number of cores
>>>  (process-run-function ...)))
>>> where each thread does 100000 iterations of a simple foreign function call.
>>> What generally happens here is that we create threads faster than they can
>>> run to completion; I think that then number of active threads gets up into
>>> the 100s in some cases, and the term "active" is a bit of a misnomer, since
>>> most of them sit idle while the few that get scheduled do enough consing to
>>> trigger the GC, where we spend most of our time.
>>> I changed the code to:
>>> (let* ((sem (make-semaphore)))
>>> (dotimes (i n)
>>>   (dotimes (j m)
>>>     (process-run-function whatever
>>>       (lambda ()
>>>         (dotimes (k 100000)
>>>           ..)
>>>         (signal-semaphore sem))))
>>>   (dotimes (j m) (wait-on-semaphore sem))))
>>> e.g., to create M threads on each iteration of the loop and wait for them to
>>> run to completion before creating more.  Those M threads should spend most
>>> of their time running (and entering or returning from foreign function calls
>>> when the GC runs), and that seems to trigger the problem more reliably.
>>> I'll try to look at this as time permits and I don't know how long it'll
>>> take me to see the problem when I do, but I think that I can at least reproduce
>>> the problem much more reliably than I've been able to.
>>> On Mon, 4 Nov 2013, Gary Byers wrote:
>>>> Thanks.
>>>> Coincidentally, clozure.com crashed today; the hosting service moved it to
>>>> a new machine (an 8-core Xeon).  Hmmm ...
>>>> On Mon, 4 Nov 2013, Paul Meurer wrote:
>>>>> I did do the experiment you proposed.
>>>>> On the older Xeon 4-core machine, the crashes get still somewhat less frequent, but this?might be insignificant
>>>>> because I didn't run enough iterations. A crash occurs not more than?every 50th iteration in average. Perhaps not
>>>>> often enough for convenient debugging.
>>>>> Here are the specs:
>>>>> processor ? ? ? : 0
>>>>> vendor_id ? ? ? : GenuineIntel
>>>>> cpu family ? ? ?: 6
>>>>> model ? ? ? ? ? : 15
>>>>> model name ? ? ?: Intel(R) Xeon(R) CPU ? ? ? ? ? ?5140 ?@ 2.33GHz
>>>>> stepping ? ? ? ?: 6
>>>>> cpu MHz ? ? ? ? : 2327.528
>>>>> cache size ? ? ?: 4096 KB
>>>>> physical id ? ? : 0
>>>>> siblings ? ? ? ?: 2
>>>>> core id ? ? ? ? : 0
>>>>> cpu cores ? ? ? : 2
>>>>> fpu ? ? ? ? ? ? : yes
>>>>> fpu_exception ? : yes
>>>>> cpuid level ? ? : 10
>>>>> wp ? ? ? ? ? ? ?: yes
>>>>> flags ? ? ? ? ? : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat?pse36 clflush dts acpi mmx
>>>>> fxsr sse sse2 ss ht tm syscall nx lm co
>>>>> nstant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
>>>>> bogomips ? ? ? ?: 4659.22
>>>>> clflush size ? ?: 64
>>>>> cache_alignment : 64
>>>>> address sizes ? : 36 bits physical, 48 bits virtual
>>>>> power management:
>>>>> On the 16-core machine (16 inc. hyperthreading), nothing seems to have changed. The?latter has two CPUs with these
>>>>> specs:
>>>>> Intel Xeon E5 4-Core - E5-2643 3.30GHz 10MB LGA2011 8.0GT/
>>>>> or from /proc/cpuinfo:
>>>>> processor ? ? ? : 0
>>>>> vendor_id ? ? ? : GenuineIntel
>>>>> cpu family ? ? ?: 6
>>>>> model ? ? ? ? ? : 45
>>>>> model name ? ? ?: Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz
>>>>> stepping ? ? ? ?: 7
>>>>> cpu MHz ? ? ? ? : 3301.000
>>>>> cache size ? ? ?: 10240 KB
>>>>> physical id ? ? : 0
>>>>> siblings ? ? ? ?: 8
>>>>> core id ? ? ? ? : 0
>>>>> cpu cores ? ? ? : 4
>>>>> apicid ? ? ? ? ?: 0
>>>>> initial apicid ?: 0
>>>>> fpu ? ? ? ? ? ? : yes
>>>>> fpu_exception ? : yes
>>>>> cpuid level ? ? : 13
>>>>> wp ? ? ? ? ? ? ?: yes
>>>>> flags ? ? ? ? ? : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov?pat pse36 clflush dts acpi mmx fxsr
>>>>> sse sse2 ss ht tm pbe syscall nx?pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good?xtopology
>>>>> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx?smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2
>>>>> x2apic popcnt aes?xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi?flexpriority ept vpid
>>>>> bogomips ? ? ? ?: 6583.99
>>>>> clflush size ? ?: 64
>>>>> cache_alignment : 64
>>>>> address sizes ? : 46 bits physical, 48 bits virtual
>>>>> power management:
>>>>> I also run the code on an AMD Opteron machine, but there, no crashes occur,?as far as I can see (after 50
>>>>> iterations).
>>>>> 
>>>>>      I tried running your example in an infinite loop on another Core-i7 machine.
>>>>>      After about an hour, it crashed in the way that you describe. ?I poked around
>>>>>      a bit in GDB but wasn't sure what I was seeing; the C code in the CCL kernel
>>>>>      (including the GC) is usually compiled with the -O2 option, which makes it
>>>>>      run faster but makes debugging harder.
>>>>> 
>>>>>      I figured that things would go faster if debugging was easier, so I rebuilt
>>>>>      the kernel without -O2 and tried again. ?It's been running for over 24 hours
>>>>>      at this point without incident.
>>>>> 
>>>>>      Aside from being yet another example of the famous Heisenbug Uncertainty
>>>>>      Principle, this suggests that how the C code is compiled (by what version
>>>>>      of what compiler and at what optimization settings) may have something to
>>>>>      do with the problem (or at least the frequency at which it occurs.)
>>>>> 
>>>>>      I'm curious as to whether building the kernel without -O2 causes things to
>>>>>      behave differently for you. ?To test this:
>>>>> 
>>>>>      $ cd ccl/lisp-kernel/linuxx8664
>>>>>      Edit the Makefile in that directory, changing the line:
>>>>> 
>>>>>      COPT = -O2
>>>>> 
>>>>>      to
>>>>> 
>>>>>      COPT = #-02
>>>>> 
>>>>>      $ make clean
>>>>>      $ make
>>>>> 
>>>>>      If the problem still occurs for you with the same frequency that it's been occurring
>>>>>      on your Xeons, that'd tell us something (the the differences between the Xeon and
>>>>>      other x8664 machines have more to do with the frequency with which the problem
>>>>>      occurs than compiler issues do.) ?If that change masks or avoids the problem, that'd
>>>>>      tell us a bit less. ?In either case, if you can try this experiment it'd be good to
>>>>>      know the results.
>>>>> 
>>>>>      If the processor difference remains a likely candidate, it'd be helpful to know
>>>>>      the exact model number of the (smaller, 4-core) Xeon machine where the problem
>>>>>      occurs (frequently) for you. ?Doing
>>>>> 
>>>>>      $ cat /proc/cpuinfo
>>>>> 
>>>>>      may list this info under "model name" for each core.
>>>>> 
>>>>>      I've been able to reprouduce the problem twice on Core i7 machines in a few days
>>>>>      of trying, and it'd likely be easiest for me to understand an fix if it was easier
>>>>>      for me to reproduce.
>>>>> 
>>>>>      On Thu, 31 Oct 2013, Paul Meurer wrote:
>>>>> 
>>>>>            Am 31.10.2013 um 01:15 schrieb Gary Byers <gb at clozure.com>:
>>>>> 
>>>>>            ? ? ?On Wed, 30 Oct 2013, Paul Meurer wrote:
>>>>>            ? ? ? ? ? ?I run it now with --no-init and in the shell, with no difference. Immediate
>>>>>            failure?with
>>>>>            ? ? ? ? ? ?:consing in *features*,
>>>>>            ? ? ? ? ? ?bogus objects etc. after several rounds without :consing.
>>>>> 
>>>>>            ? ? ?So, I can't rant and rave about the sorry state of 3rd-party CL libraries, and
>>>>>            ? ? ?anyone reading this won't be subjected to me doing so ?
>>>>> 
>>>>>            ? ? ?Oh well.
>>>>> 
>>>>>            ? ? ?I was able to reproduce the problem by running your test 100 times,
>>>>>            I am not able to provoke it at all on the MacBook, and I tried a lot.
>>>>> 
>>>>>            ? ? ?so apparently
>>>>>            ? ? ?I won't be able to blame this on some aspect of your machine. ?(Also unfortunate,
>>>>>            ? ? ?since my ability to diagnose problems that only occur on 16-core machines depends
>>>>>            ? ? ?on my ability to borrow such machines for a few months.)
>>>>>            I think you can do without a 16-core machine. I am able to reproduce the failure
>>>>>            quite?reliably on an older 4-core
>>>>>            machine with Xeon CPUs and SuSE, with slightly different code (perhaps to get the?timing
>>>>>            right):
>>>>>            (dotimes (j 100)
>>>>>            ? (print (ccl::all-processes))
>>>>>            ? (dotimes (i 8)
>>>>>            ? ? (process-run-function
>>>>>            ? ? ?(format nil "getstring-~a-~a" j i)
>>>>>            ? ? ?(lambda (i)
>>>>>            ? ? ? ?(let ((list ()))
>>>>>            ? ? ? ? ?(dotimes (i 500000)
>>>>>            ? ? ? ? ? ?(push (getstring) list)))
>>>>>            ? ? ? ?(print i))
>>>>>            ? ? ?i))
>>>>>            ? (print (list :done j))
>>>>>            ? (sleep 1))
>>>>>            If you really need a 16-core machine to debug this I can give you access to mine. :-)
>>>>> 
>>>>>            ? ? ? ? ? ?My machine has 16 true cores and hyperthreading; I am running CentOS 6.0,?and a
>>>>>            recent CCL
>>>>>            ? ? ? ? ? ?1.9 (I did svn update +
>>>>>            ? ? ? ? ? ?rebuild of everything yesterday).
>>>>>            ? ? ? ? ? ?I also observed that the problem goes away when I replace the constant string
>>>>>            in?the
>>>>>            ? ? ? ? ? ?library by a freshly
>>>>>            ? ? ? ? ? ?allocated string:
>>>>>            ? ? ? ? ? ?char *getstring() {?
>>>>>            ? ? ? ? ? ?? int index;
>>>>>            ? ? ? ? ? ?? char *buffer = (char *)calloc(100 + 1, sizeof(char));
>>>>>            ? ? ? ? ? ?? for (index = 0; index < 100; index++) {
>>>>>            ? ? ? ? ? ?? ? ? buffer[index] = 'a';
>>>>>            ? ? ? ? ? ?? ? }
>>>>>            ? ? ? ? ? ?? buffer[100] = '\0';
>>>>>            ? ? ? ? ? ?? return buffer ;
>>>>>            ? ? ? ? ? ?}
>>>>>            ? ? ? ? ? ?One should expect the strings in the Postgres library to be freshly allocated,
>>>>>            but
>>>>>            ? ? ? ? ? ?nevertheless they behave like
>>>>>            ? ? ? ? ? ?the constant string example.
>>>>> 
>>>>>            ? ? ?It's unlikely that this change directly avoids the bug (whatever it is); it's more
>>>>>            ? ? ?likely that it affects timing (exactly what happens when.) ?I don't yet know what
>>>>>            ? ? ?the bug is, but I think that it's likely that it's fair to characterize the bug
>>>>>            ? ? ?as being "timing-sensitive". ?(For example: from the GC's point of view, whether
>>>>>            ? ? ?a thread is running Lisp or foreign code when that thread is suspended by the GC.
>>>>>            ? ? ?The transition between Lisp and foreign code takes a few instructions, and if
>>>>>            ? ? ?a thread is suspended in the middle of that instruction sequence and the GC
>>>>>            ? ? ?misintrprets its state, very bad things like what you're seeing could occur.
>>>>>            ? ? ?That's not supposed to be possible, but something broadly similar seems to be
>>>>>            ? ? ?happening.)
>>>>>            --?
>>>>>            Paul
>>>>> --?
>>>>> Paul
>>>> _______________________________________________
>>>> Openmcl-devel mailing list
>>>> Openmcl-devel at clozure.com
>>>> http://clozure.com/mailman/listinfo/openmcl-devel
>>> _______________________________________________
>>> Openmcl-devel mailing list
>>> Openmcl-devel at clozure.com
>>> http://clozure.com/mailman/listinfo/openmcl-devel
>> _______________________________________________
>> Openmcl-devel mailing list
>> Openmcl-devel at clozure.com
>> http://clozure.com/mailman/listinfo/openmcl-devel
>> 
>> 

-- 
Paul

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clozure.com/pipermail/openmcl-devel/attachments/20131210/1b470042/attachment.htm>