[Openmcl-devel] Need advice to debug segfault when running concurrent selects in clsql/postgresql
Paul Meurer
Paul.Meurer at uni.no
Tue Dec 10 11:07:17 PST 2013
Am 09.12.2013 um 21:23 schrieb Gary Byers <gb at clozure.com>:
> Rumors that I've become distracted and forgotten about this are at least
> slightly exaggerated.
>
> Could you (Paul) please try to reproduce the problem in the current trunk (r15975 or later) ? I haven't been able to do so, but I often found
> the problem harder to reproduce than you did.
I did test the newest trunk (after runnnig (rebuild-ccl :full t), giving me Welcome to Clozure Common Lisp Version 1.10-dev-r15975M-trunk (LinuxX8664)!).
Unfortunately, the problem remains the same, I am getting the same two types of symptoms you are describing.
>
> On Thu, 7 Nov 2013, Gary Byers wrote:
>
>> [Sorry if I'm bombarding this list with a discussion that may not be of
>> wide interest. If you aren't interested in this and don't read this message,
>> rest assured that you aren't likely to be missing anything ...]
>>
>> The bug that Paul Meurer found seems to have two classes of symptoms:
>>
>> 1) a hard crash into the kernel debugger, usually due to a GC function
>> (set_n_bits()) being confused about the number of elements in a vector-like
>> object. That's typically a sign of memory corruption (unsafe lisp code
>> storing outside the bounds of an object.)
>>
>> 2) a (small) variety of memory faults and type errors signaled as lisp conditions.
>> (Sometimes. these are followed by case(1) soon after.) These kinds of errors
>> can also be symptoms of the same kind of "memory corruption".
>>
>> When it was first reported, I suggested enabling some consistency checks in
>> the GC that often find problems like this near the time they occur. Paul
>> said that he tried enabling those checks but they didn't seem to detect
>> any problems but the crashes and faults continued to occur.
>>
>> I can reproduce the problems more reliably now than I'd been able to and can
>> look at things in GDB when they occur. One of my favorite class 2 symptoms
>> so far is:
>>
>>> Error: The value #<A Foreign Pointer #x7FFFD6C15690> is not of the expected type MACPTR.
>>> While executing: %GET-CSTRING, in process getstring1(71).
>>
>> which is sort of saying "that thing that should have been a foreign pointer
>> (aka MACPTR) wasn't of that type when I looked at it a few cycles ago. Never
>> mind the fact that it is of that type now." Of course that makes no sense,
>> and of course it "can't happen."
>>
>> This is vaguely reminiscent of a situation that arises on machines that have
>> seperate I and D caches: we write some code to memory and execute it (without
>> properly synchronizing the caches first), and what actually gets executed
>> doesn't have anything to do with the code in memory. If that happens often
>> enough or has happened recently, it's possible to be suspicious of that issue;
>> otherwise, it's a very confusing situation.
>>
>> So: some number of cores are running Paul's test case; they're doing speculative
>> memory accesses and doing things out-of-order and doing all of the things that
>> a modern CPU does (all the while maintaining the illusion - mostly - that things
>> are happening in "program" or "architectural" order. Some thread needs to
>> run the GC, so that thread suspends all other threads. When a thread is suspended,
>> its "context" (architecturally-visible register values) are store someplace where
>> the GC can find it; the GC will use those register values to identify reachable
>> objects and if it moves such objects it updates the register values in the saved
>> context info. When the GC finishes, it causes the suspended thread to resume
>> execution with the updated (architecturally visible) register values; the behavior
>> that we're seeing is as if the core that the thread was running on resumes the
>> speculative, out-of-order execution that it was doing (that was based on old
>> register values and we see things that can't happen happen.
>>
>> Some x86 instructions force things to synchronize with the "architectural" state.
>> When the GC suspends a thread, it atomically increments a counter associated with
>> the thread and only actually does the OS-level suspension if that counter changed
>> from 0 to 1. Only one thread at a time can increment that counter (that didn't
>> used to be the case), but it still does an "atomic" incf of the counter because
>> the instructions involved are supposed to have this synchronizing effect. They
>> may do so, but the thread can resume its wacky out-of-order speculative execution
>> as soon as the counter has incremented and it won't actually be suspended until
>> some number of cycles later.
>>
>> I've tried to force things to synchronize after all other threads have suspended
>> and haven't seen the problem(s) occur since that change. That's not conclusive
>> (since none of this is entirely deterministic), but it seems encouraging.
>>
>> If this does explain the problems, it's a little odd that we haven't seen the
>> problem occur more often (assuming that we haven't.) I assume that a number
>> of things have to happen just right (or "just wrong") in order for the symptoms
>> above to occur.
>>
>> I'll keep looking at this as time permits and if the results hold I'll try to
>> check in the change in the next few days.
>>
>> On Thu, 7 Nov 2013, Gary Byers wrote:
>>
>>> I didn't have any luck reproducing the problem on that Xeon (or on another
>>> one that rme pointed out we had), but I think that I found a way to make
>>> the problem occur much more reliably than it has been.
>>> What I've been doing (based on your code) is something like:
>>> (dotimes (i n) ; n is some large number
>>> (dotimes (j m) ; m is proportional to the number of cores
>>> (process-run-function ...)))
>>> where each thread does 100000 iterations of a simple foreign function call.
>>> What generally happens here is that we create threads faster than they can
>>> run to completion; I think that then number of active threads gets up into
>>> the 100s in some cases, and the term "active" is a bit of a misnomer, since
>>> most of them sit idle while the few that get scheduled do enough consing to
>>> trigger the GC, where we spend most of our time.
>>> I changed the code to:
>>> (let* ((sem (make-semaphore)))
>>> (dotimes (i n)
>>> (dotimes (j m)
>>> (process-run-function whatever
>>> (lambda ()
>>> (dotimes (k 100000)
>>> ..)
>>> (signal-semaphore sem))))
>>> (dotimes (j m) (wait-on-semaphore sem))))
>>> e.g., to create M threads on each iteration of the loop and wait for them to
>>> run to completion before creating more. Those M threads should spend most
>>> of their time running (and entering or returning from foreign function calls
>>> when the GC runs), and that seems to trigger the problem more reliably.
>>> I'll try to look at this as time permits and I don't know how long it'll
>>> take me to see the problem when I do, but I think that I can at least reproduce
>>> the problem much more reliably than I've been able to.
>>> On Mon, 4 Nov 2013, Gary Byers wrote:
>>>> Thanks.
>>>> Coincidentally, clozure.com crashed today; the hosting service moved it to
>>>> a new machine (an 8-core Xeon). Hmmm ...
>>>> On Mon, 4 Nov 2013, Paul Meurer wrote:
>>>>> I did do the experiment you proposed.
>>>>> On the older Xeon 4-core machine, the crashes get still somewhat less frequent, but this?might be insignificant
>>>>> because I didn't run enough iterations. A crash occurs not more than?every 50th iteration in average. Perhaps not
>>>>> often enough for convenient debugging.
>>>>> Here are the specs:
>>>>> processor ? ? ? : 0
>>>>> vendor_id ? ? ? : GenuineIntel
>>>>> cpu family ? ? ?: 6
>>>>> model ? ? ? ? ? : 15
>>>>> model name ? ? ?: Intel(R) Xeon(R) CPU ? ? ? ? ? ?5140 ?@ 2.33GHz
>>>>> stepping ? ? ? ?: 6
>>>>> cpu MHz ? ? ? ? : 2327.528
>>>>> cache size ? ? ?: 4096 KB
>>>>> physical id ? ? : 0
>>>>> siblings ? ? ? ?: 2
>>>>> core id ? ? ? ? : 0
>>>>> cpu cores ? ? ? : 2
>>>>> fpu ? ? ? ? ? ? : yes
>>>>> fpu_exception ? : yes
>>>>> cpuid level ? ? : 10
>>>>> wp ? ? ? ? ? ? ?: yes
>>>>> flags ? ? ? ? ? : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat?pse36 clflush dts acpi mmx
>>>>> fxsr sse sse2 ss ht tm syscall nx lm co
>>>>> nstant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
>>>>> bogomips ? ? ? ?: 4659.22
>>>>> clflush size ? ?: 64
>>>>> cache_alignment : 64
>>>>> address sizes ? : 36 bits physical, 48 bits virtual
>>>>> power management:
>>>>> On the 16-core machine (16 inc. hyperthreading), nothing seems to have changed. The?latter has two CPUs with these
>>>>> specs:
>>>>> Intel Xeon E5 4-Core - E5-2643 3.30GHz 10MB LGA2011 8.0GT/
>>>>> or from /proc/cpuinfo:
>>>>> processor ? ? ? : 0
>>>>> vendor_id ? ? ? : GenuineIntel
>>>>> cpu family ? ? ?: 6
>>>>> model ? ? ? ? ? : 45
>>>>> model name ? ? ?: Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz
>>>>> stepping ? ? ? ?: 7
>>>>> cpu MHz ? ? ? ? : 3301.000
>>>>> cache size ? ? ?: 10240 KB
>>>>> physical id ? ? : 0
>>>>> siblings ? ? ? ?: 8
>>>>> core id ? ? ? ? : 0
>>>>> cpu cores ? ? ? : 4
>>>>> apicid ? ? ? ? ?: 0
>>>>> initial apicid ?: 0
>>>>> fpu ? ? ? ? ? ? : yes
>>>>> fpu_exception ? : yes
>>>>> cpuid level ? ? : 13
>>>>> wp ? ? ? ? ? ? ?: yes
>>>>> flags ? ? ? ? ? : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov?pat pse36 clflush dts acpi mmx fxsr
>>>>> sse sse2 ss ht tm pbe syscall nx?pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good?xtopology
>>>>> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx?smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2
>>>>> x2apic popcnt aes?xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi?flexpriority ept vpid
>>>>> bogomips ? ? ? ?: 6583.99
>>>>> clflush size ? ?: 64
>>>>> cache_alignment : 64
>>>>> address sizes ? : 46 bits physical, 48 bits virtual
>>>>> power management:
>>>>> I also run the code on an AMD Opteron machine, but there, no crashes occur,?as far as I can see (after 50
>>>>> iterations).
>>>>>
>>>>> I tried running your example in an infinite loop on another Core-i7 machine.
>>>>> After about an hour, it crashed in the way that you describe. ?I poked around
>>>>> a bit in GDB but wasn't sure what I was seeing; the C code in the CCL kernel
>>>>> (including the GC) is usually compiled with the -O2 option, which makes it
>>>>> run faster but makes debugging harder.
>>>>>
>>>>> I figured that things would go faster if debugging was easier, so I rebuilt
>>>>> the kernel without -O2 and tried again. ?It's been running for over 24 hours
>>>>> at this point without incident.
>>>>>
>>>>> Aside from being yet another example of the famous Heisenbug Uncertainty
>>>>> Principle, this suggests that how the C code is compiled (by what version
>>>>> of what compiler and at what optimization settings) may have something to
>>>>> do with the problem (or at least the frequency at which it occurs.)
>>>>>
>>>>> I'm curious as to whether building the kernel without -O2 causes things to
>>>>> behave differently for you. ?To test this:
>>>>>
>>>>> $ cd ccl/lisp-kernel/linuxx8664
>>>>> Edit the Makefile in that directory, changing the line:
>>>>>
>>>>> COPT = -O2
>>>>>
>>>>> to
>>>>>
>>>>> COPT = #-02
>>>>>
>>>>> $ make clean
>>>>> $ make
>>>>>
>>>>> If the problem still occurs for you with the same frequency that it's been occurring
>>>>> on your Xeons, that'd tell us something (the the differences between the Xeon and
>>>>> other x8664 machines have more to do with the frequency with which the problem
>>>>> occurs than compiler issues do.) ?If that change masks or avoids the problem, that'd
>>>>> tell us a bit less. ?In either case, if you can try this experiment it'd be good to
>>>>> know the results.
>>>>>
>>>>> If the processor difference remains a likely candidate, it'd be helpful to know
>>>>> the exact model number of the (smaller, 4-core) Xeon machine where the problem
>>>>> occurs (frequently) for you. ?Doing
>>>>>
>>>>> $ cat /proc/cpuinfo
>>>>>
>>>>> may list this info under "model name" for each core.
>>>>>
>>>>> I've been able to reprouduce the problem twice on Core i7 machines in a few days
>>>>> of trying, and it'd likely be easiest for me to understand an fix if it was easier
>>>>> for me to reproduce.
>>>>>
>>>>> On Thu, 31 Oct 2013, Paul Meurer wrote:
>>>>>
>>>>> Am 31.10.2013 um 01:15 schrieb Gary Byers <gb at clozure.com>:
>>>>>
>>>>> ? ? ?On Wed, 30 Oct 2013, Paul Meurer wrote:
>>>>> ? ? ? ? ? ?I run it now with --no-init and in the shell, with no difference. Immediate
>>>>> failure?with
>>>>> ? ? ? ? ? ?:consing in *features*,
>>>>> ? ? ? ? ? ?bogus objects etc. after several rounds without :consing.
>>>>>
>>>>> ? ? ?So, I can't rant and rave about the sorry state of 3rd-party CL libraries, and
>>>>> ? ? ?anyone reading this won't be subjected to me doing so ?
>>>>>
>>>>> ? ? ?Oh well.
>>>>>
>>>>> ? ? ?I was able to reproduce the problem by running your test 100 times,
>>>>> I am not able to provoke it at all on the MacBook, and I tried a lot.
>>>>>
>>>>> ? ? ?so apparently
>>>>> ? ? ?I won't be able to blame this on some aspect of your machine. ?(Also unfortunate,
>>>>> ? ? ?since my ability to diagnose problems that only occur on 16-core machines depends
>>>>> ? ? ?on my ability to borrow such machines for a few months.)
>>>>> I think you can do without a 16-core machine. I am able to reproduce the failure
>>>>> quite?reliably on an older 4-core
>>>>> machine with Xeon CPUs and SuSE, with slightly different code (perhaps to get the?timing
>>>>> right):
>>>>> (dotimes (j 100)
>>>>> ? (print (ccl::all-processes))
>>>>> ? (dotimes (i 8)
>>>>> ? ? (process-run-function
>>>>> ? ? ?(format nil "getstring-~a-~a" j i)
>>>>> ? ? ?(lambda (i)
>>>>> ? ? ? ?(let ((list ()))
>>>>> ? ? ? ? ?(dotimes (i 500000)
>>>>> ? ? ? ? ? ?(push (getstring) list)))
>>>>> ? ? ? ?(print i))
>>>>> ? ? ?i))
>>>>> ? (print (list :done j))
>>>>> ? (sleep 1))
>>>>> If you really need a 16-core machine to debug this I can give you access to mine. :-)
>>>>>
>>>>> ? ? ? ? ? ?My machine has 16 true cores and hyperthreading; I am running CentOS 6.0,?and a
>>>>> recent CCL
>>>>> ? ? ? ? ? ?1.9 (I did svn update +
>>>>> ? ? ? ? ? ?rebuild of everything yesterday).
>>>>> ? ? ? ? ? ?I also observed that the problem goes away when I replace the constant string
>>>>> in?the
>>>>> ? ? ? ? ? ?library by a freshly
>>>>> ? ? ? ? ? ?allocated string:
>>>>> ? ? ? ? ? ?char *getstring() {?
>>>>> ? ? ? ? ? ?? int index;
>>>>> ? ? ? ? ? ?? char *buffer = (char *)calloc(100 + 1, sizeof(char));
>>>>> ? ? ? ? ? ?? for (index = 0; index < 100; index++) {
>>>>> ? ? ? ? ? ?? ? ? buffer[index] = 'a';
>>>>> ? ? ? ? ? ?? ? }
>>>>> ? ? ? ? ? ?? buffer[100] = '\0';
>>>>> ? ? ? ? ? ?? return buffer ;
>>>>> ? ? ? ? ? ?}
>>>>> ? ? ? ? ? ?One should expect the strings in the Postgres library to be freshly allocated,
>>>>> but
>>>>> ? ? ? ? ? ?nevertheless they behave like
>>>>> ? ? ? ? ? ?the constant string example.
>>>>>
>>>>> ? ? ?It's unlikely that this change directly avoids the bug (whatever it is); it's more
>>>>> ? ? ?likely that it affects timing (exactly what happens when.) ?I don't yet know what
>>>>> ? ? ?the bug is, but I think that it's likely that it's fair to characterize the bug
>>>>> ? ? ?as being "timing-sensitive". ?(For example: from the GC's point of view, whether
>>>>> ? ? ?a thread is running Lisp or foreign code when that thread is suspended by the GC.
>>>>> ? ? ?The transition between Lisp and foreign code takes a few instructions, and if
>>>>> ? ? ?a thread is suspended in the middle of that instruction sequence and the GC
>>>>> ? ? ?misintrprets its state, very bad things like what you're seeing could occur.
>>>>> ? ? ?That's not supposed to be possible, but something broadly similar seems to be
>>>>> ? ? ?happening.)
>>>>> --?
>>>>> Paul
>>>>> --?
>>>>> Paul
>>>> _______________________________________________
>>>> Openmcl-devel mailing list
>>>> Openmcl-devel at clozure.com
>>>> http://clozure.com/mailman/listinfo/openmcl-devel
>>> _______________________________________________
>>> Openmcl-devel mailing list
>>> Openmcl-devel at clozure.com
>>> http://clozure.com/mailman/listinfo/openmcl-devel
>> _______________________________________________
>> Openmcl-devel mailing list
>> Openmcl-devel at clozure.com
>> http://clozure.com/mailman/listinfo/openmcl-devel
>>
>>
--
Paul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clozure.com/pipermail/openmcl-devel/attachments/20131210/1b470042/attachment.htm>
More information about the Openmcl-devel
mailing list