[Openmcl-devel] Need advice to debug segfault when running concurrent selects in clsql/postgresql

Gary Byers gb at clozure.com
Thu Oct 31 13:45:13 PDT 2013


On 10/31/13 3:48 AM, Paul Meurer wrote:
>
> Am 31.10.2013 um 01:15 schrieb Gary Byers <gb at clozure.com 
> <mailto:gb at clozure.com>>:
>
>> On Wed, 30 Oct 2013, Paul Meurer wrote:
>>> I run it now with --no-init and in the shell, with no difference. 
>>> Immediate failure with :consing in *features*,
>>> bogus objects etc. after several rounds without :consing.
>>
>> So, I can't rant and rave about the sorry state of 3rd-party CL 
>> libraries, and
>> anyone reading this won't be subjected to me doing so ?
>>
>> Oh well.
>>
>> I was able to reproduce the problem by running your test 100 times, 
>
> I am not able to provoke it at all on the MacBook, and I tried a lot.
>
>> so apparently
>> I won't be able to blame this on some aspect of your machine.  (Also 
>> unfortunate,
>> since my ability to diagnose problems that only occur on 16-core 
>> machines depends
>> on my ability to borrow such machines for a few months.)
>
> I think you can do without a 16-core machine. I am able to reproduce 
> the failure quite reliably on an older 4-core machine with Xeon CPUs 
> and SuSE, with slightly different code (perhaps to get the timing right):

For the last several years (since the Pentium II ?) have treated x86 
instructions as a kind of bytecode that's dynamically translated
into code for a (largely undocumented) RISC-y microengine. Different x86 
implementations do this translation a little differently
(and may implement somewhat different microengines); some sequences of 
x86 instructions (bytecodes) may be treated as
a single micro operation in some implementations and not others, and the 
factors that govern this can be quite complex.
(Agner Fog has done a lot of research into this - as far as I know, it's 
all based on reverse-engineering - and maintains his
findings at

<http://www.agner.org/optimize/>

.)

This is potentially relevant here in that if it's the case that if the 
GC misinterprets a thread's state if that thread is stopped at
a particular x86 instruction (e.g., when entering or returning from 
foreign code), it may be the case that some x86 implementations
never (or very rarely) see that particular instruction as a separate 
instruction and other implementations always/often do.

I tried 100 iterations of your original test on a Core i7 laptop, and 
was just about to conclude that I couldn't reproduce the
problem when it failed; I believe you if you say that you haven't been 
able to get it to fail on anything but a Xeon.  I'd be a little
more confident in this theory than I am if I understood why I ever 
failed on my laptop (does the translation behave differently
in some cases than in others on the same machine ?), but I suspect that 
if I read Agner Fog's papers carefully I'd understand that
a bit better.

I think that the Intel ATOM (which was used in netbooks a few years ago 
and which they're still trying to refine so that it could
be used on mobile devices) is different from both the Xeon and the 
Core-2/Core-i machines at this level, and am curious
about whether it fails on an ATOM-based netbook.  (I don't have any 
working Xeons, but still have  a netbook and can use
something else to prop that door open ...

> If you really need a 16-core machine to debug this I can give you 
> access to mine. :-)
Thanks, but I'd need physical access to the machine, possibly for many 
months (years)
after the problem's solved.

>> It's unlikely that this change directly avoids the bug (whatever it 
>> is); it's more
>> likely that it affects timing (exactly what happens when.)  I don't 
>> yet know what
>> the bug is, but I think that it's likely that it's fair to 
>> characterize the bug
>> as being "timing-sensitive".  (For example: from the GC's point of 
>> view, whether
>> a thread is running Lisp or foreign code when that thread is 
>> suspended by the GC.

If anyone actually cares, this sentence should probably read "... is 
suspended by the
GC is significant; it affects whether the values in the thread's 
registers should be
interpreted as "references to Lisp objects" or as "random bit patterns 
of no interest
to the GC."

>> The transition between Lisp and foreign code takes a few 
>> instructions, and if
>> a thread is suspended in the middle of that instruction sequence and 
>> the GC
>> misintrprets its state, very bad things like what you're seeing could 
>> occur.
>> That's not supposed to be possible, but something broadly similar 
>> seems to be
>> happening.)

I was able to attach GDB to the crashed CCL after I provoked the crash 
on my laptop,
and what I can tell about what's happened is consistent with this theory.
>
> -- 
> Paul
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clozure.com/pipermail/openmcl-devel/attachments/20131031/fd490242/attachment.htm>


More information about the Openmcl-devel mailing list