[Openmcl-devel] ccl crashes

Wed Jan 20 15:25:30 PST 2010

Same OS kernel on both machines ?

Of the two symptoms that you reported - an unhandled segfault when running
simple code in a couple of threads and SQRT getting called on a negative
argument - with some other things apparently wrong - the second is a bit
harder to attribute to anything other than "random flakiness", since the
execution path there is pretty simple.  (Well, comparatively simple.)

In:

(defun f ()
   (let ((x 2.0d0))
     (dotimes (i most-positive-fixnum)
       (setf x (sqrt x)))
    x))

X should take on decreasing values from 2.0d0 to ~1.0d0 (and the real
value of X will eventually become so close to 1.0d0 as to be
indistinguishable from it, so we eventually keep setting X to (SQRT 1.0d0).
For SQRT to be called with a negative argument (for the value of X to
have looked like a negative number) as it apparently was in your case,
something pretty bad/flaky must have happened.  Those bad things include:

   - a CCL compiler bug
   - a CCL GC bug (we're consing DOUBLE-FLOATs at a pretty high rate)
   - a subtle bug in the OS kernel.

The first two of these things are certainly possible, but it's very
hard for me to understand how either of them would be difficult to
reproduce and not affect lots of other things.  The third class of
things is also possible, but at first glance it may seem unlikely that
such a bug would affect CCL and apparently not affect anything else
running in the same environment.

On second (... nth ...) thought, there are a few things that can go
wrong at the OS level here.  Most of what your F function does is:

  1.  Call the foreign #_sqrt function with the value of X; this
      returns (SQRT X) in a floating-point register; let's call that
      register "fp0".
  2.  Allocate a lisp DOUBLE-FLOAT object
  3.  Store the value in fp0 in that object.
  4.  Set X to the newly-allocated DOUBLE-FLOAT and go back to step 1
      (unless (= i most-positive-fixnum), which will happen some day ...)

The description of step 2 makes that sound a lot simpler than it is;
allocating a lisp object often just involves adjusting a pointer in
thread-local storage, but it occasionally involves doing much more
complicated things when there isn't enough room for the object in
previously-allocated memory; the exceptional case (allocate more
memory, GCing if necessary) is entered by forcing a machine exception.
We're basically trusting the OS's exception-handling mechanism to
preserve and restore the state of the thread so that (among other
things) when we return from the exception and get to step 3, the
value of the fp0 register is the same as it was before the exception.
The behavior that you describe could be caused by a failure of the
OS-level exception handling mechanism to restore fp0 correctly on
exception return.

Some fairly recent versions of the Linux kernel have had a bug which
might also be involved (see

<https://bugs.launchpad.net/ubuntu/+source/linux/+bug/199072>).

That page seems to indicate that that bug was fixed in Ubuntu
releases quite a while ago (2008), but I suppose that it's possible
that it or something like it crept back into the kernel(s) that you're
running. I don't think that CCL has ever done anything to work around
that bug (besides snickering derisively.)

As handwavy as this all is, attributing the error (SQRT getting called
on a negative argument when that shouldn't be possible) to "the OS
failing to restore one or more FP regs correctly on exception return
in some cases for some reason" strikes me as being a better
explanation than attributing it to a compiler/GC bug (or anything else
that CCL has much control over): compiler and GC bugs that could cause
the value of X to be negative as it was in your case tend to be
reproducible; an OS-level bug in code that saves/restores exception
context would only affect applications that depend on that stuff
working right.  (CCL expects that stuff to work, some other Lisp
implementations may, not sure of too many other examples.)

If the bug referenced above hadn't been fixed, it'd pretty clearly
explain both the SQRT problem and the abrupt termination problem.

I've been running your 2-threaded example for about 10 hours on
a Fedora 12 box and for about 2 hours on a machine running Ubuntu
9.10 (both machines running 2.6.31* kernels.)  On the machine
that's running Fedora, both threads have done about 370 billion
iterations of the loop (still a long, long way to go to get to
MOST-POSITIVE-FIXNUM), the GC's run about 5.7 million times, and
nothing unexpected or interesting has happened.  That doesn't
conclusively prove that nothing bad will happen due to some
CCL problem, but it's at least strongly suggestive of that.
That strongly suggests that the problems that you're seeing
may be attributible to "something else", and the OS kernel seems
like the most likely suspect.

On Wed, 20 Jan 2010, Mario S. Mommer wrote:

>
> Hi Gary,
>
> I think that random flakiness can be ruled out, as well as a lack of
> resources. I tried this on two machines that have two different
> motherboards, and chips of different vendors. These machines are fairly
> new and have 8G ram each. They have been working very reliably under
> load for months.
>
> Regards, and thanks,
>
>         Mario.
>
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>