<html>
<head>
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 10/31/13 3:48 AM, Paul Meurer wrote:<br>
</div>
<blockquote cite="mid:159BB3EA-7976-4389-A64B-A5CC1E14C3A3@uni.no" type="cite"><br>
<div>
<div>Am 31.10.2013 um 01:15 schrieb Gary Byers <<a moz-do-not-send="true" href="mailto:gb@clozure.com">gb@clozure.com</a>>:</div>
<br>
<blockquote type="cite">On Wed, 30 Oct 2013, Paul Meurer wrote:<br>
<blockquote type="cite">I run it now with --no-init and in the
shell, with no difference. Immediate failure with :consing
in *features*,<br>
bogus objects etc. after several rounds without :consing.<br>
</blockquote>
<br>
So, I can't rant and rave about the sorry state of 3rd-party
CL libraries, and<br>
anyone reading this won't be subjected to me doing so ?<br>
<br>
Oh well.<br>
<br>
I was able to reproduce the problem by running your test 100
times, </blockquote>
<div><br>
</div>
<div>I am not able to provoke it at all on the MacBook, and I
tried a lot.</div>
<br>
<blockquote type="cite">so apparently<br>
I won't be able to blame this on some aspect of your machine.
(Also unfortunate,<br>
since my ability to diagnose problems that only occur on
16-core machines depends<br>
on my ability to borrow such machines for a few months.)<br>
</blockquote>
<div><br>
</div>
<div>I think you can do without a 16-core machine. I am able to
reproduce the failure quite reliably on an older 4-core
machine with Xeon CPUs and SuSE, with slightly different code
(perhaps to get the timing right):</div>
</div>
</blockquote>
<br>
For the last several years (since the Pentium II ?) have treated x86
instructions as a kind of bytecode that's dynamically translated<br>
into code for a (largely undocumented) RISC-y microengine.
Different x86 implementations do this translation a little
differently<br>
(and may implement somewhat different microengines); some sequences
of x86 instructions (bytecodes) may be treated as<br>
a single micro operation in some implementations and not others, and
the factors that govern this can be quite complex.<br>
(Agner Fog has done a lot of research into this - as far as I know,
it's all based on reverse-engineering - and maintains his<br>
findings at<br>
<br>
<a class="moz-txt-link-rfc2396E" href="http://www.agner.org/optimize/"><http://www.agner.org/optimize/></a><br>
<br>
.)<br>
<br>
This is potentially relevant here in that if it's the case that if
the GC misinterprets a thread's state if that thread is stopped at<br>
a particular x86 instruction (e.g., when entering or returning from
foreign code), it may be the case that some x86 implementations<br>
never (or very rarely) see that particular instruction as a separate
instruction and other implementations always/often do.<br>
<br>
I tried 100 iterations of your original test on a Core i7 laptop,
and was just about to conclude that I couldn't reproduce the<br>
problem when it failed; I believe you if you say that you haven't
been able to get it to fail on anything but a Xeon. I'd be a little<br>
more confident in this theory than I am if I understood why I ever
failed on my laptop (does the translation behave differently<br>
in some cases than in others on the same machine ?), but I suspect
that if I read Agner Fog's papers carefully I'd understand that<br>
a bit better.<br>
<br>
I think that the Intel ATOM (which was used in netbooks a few years
ago and which they're still trying to refine so that it could<br>
be used on mobile devices) is different from both the Xeon and the
Core-2/Core-i machines at this level, and am curious<br>
about whether it fails on an ATOM-based netbook. (I don't have any
working Xeons, but still have a netbook and can use<br>
something else to prop that door open ...<br>
<br>
<blockquote cite="mid:159BB3EA-7976-4389-A64B-A5CC1E14C3A3@uni.no" type="cite">
<div>If you really need a 16-core machine to debug this I can give
you access to mine. :-)<br>
</div>
</blockquote>
Thanks, but I'd need physical access to the machine, possibly for
many months (years)<br>
after the problem's solved.<br>
<br>
<blockquote cite="mid:159BB3EA-7976-4389-A64B-A5CC1E14C3A3@uni.no" type="cite">
<div>
<blockquote type="cite">It's unlikely that this change directly
avoids the bug (whatever it is); it's more<br>
likely that it affects timing (exactly what happens when.) I
don't yet know what<br>
the bug is, but I think that it's likely that it's fair to
characterize the bug<br>
as being "timing-sensitive". (For example: from the GC's
point of view, whether<br>
a thread is running Lisp or foreign code when that thread is
suspended by the GC.<br>
</blockquote>
</div>
</blockquote>
<br>
If anyone actually cares, this sentence should probably read "... is
suspended by the<br>
GC is significant; it affects whether the values in the thread's
registers should be<br>
interpreted as "references to Lisp objects" or as "random bit
patterns of no interest<br>
to the GC."<br>
<br>
<blockquote cite="mid:159BB3EA-7976-4389-A64B-A5CC1E14C3A3@uni.no" type="cite">
<div>
<blockquote type="cite">The transition between Lisp and foreign
code takes a few instructions, and if<br>
a thread is suspended in the middle of that instruction
sequence and the GC<br>
misintrprets its state, very bad things like what you're
seeing could occur.<br>
That's not supposed to be possible, but something broadly
similar seems to be<br>
happening.)<br>
</blockquote>
</div>
</blockquote>
<br>
I was able to attach GDB to the crashed CCL after I provoked the
crash on my laptop, <br>
and what I can tell about what's happened is consistent with this
theory.<br>
<blockquote cite="mid:159BB3EA-7976-4389-A64B-A5CC1E14C3A3@uni.no" type="cite"><br>
<div>
<span class="Apple-style-span" style="border-collapse: separate;
font-family: 'Lucida Grande'; border-spacing: 0px; ">-- <br>
Paul</span>
</div>
<br>
</blockquote>
<br>
</body>
</html>