[Openmcl-devel] openmcl quits without backtrace

Gary Byers gb at clozure.com
Thu Jul 28 22:40:52 UTC 2005



On Thu, 28 Jul 2005, Joshua Moody wrote:

> I work mostly in MCL, but often I switch to OpenMCL to track down bugs and/or 
> to note efficiency issues.   I've seen a strange error in MCL with respect to 
> this form/closure:
>
> (let ((copy-htable (make-hash-table #+Ignore :rehash-size #+Ignore 1.5)))
>  (defmethod copy-top-level (ORIGINAL-THING)
>    (clrhash COPY-HTABLE)
>    (copy-one ORIGINAL-THING COPY-HTABLE)))
>
> This is part of the library that my group uses.  The trouble is that the 
> copy-htable is getting smashed by something - probably GC.  This error occurs 
> after ~18 hrs of computation.  So I fired up OpenMCL to see if I could 
> reproduce the error there.  It takes only 10 minutes for OpenMCL to crash and 
> it does so without errors or a backtrace - the process just dies.  I have no 
> idea whether the two problems are related, but I usually have little (to no) 
> trouble running the offending code (and other code) in OpenMCL.
>
> Any suggestions on why OpenMCL would crash in such a way?


My guess: a bug.

There are actually a fairly small number of possible culprits, the most
likely of which is this C code in "ccl:lisp-kernel;thread-manager.c":

  ...
Boolean
suspend_tcr(TCR *tcr)
{
   int suspend_count = atomic_incf(&(tcr->suspend_count));
   if (suspend_count == 1) {
#ifdef DARWIN
       if (mach_suspend_tcr(tcr)) {
 	tcr->flags |= TCR_FLAG_BIT_ALT_SUSPEND;
 	return true;
       }
#endif
     if (pthread_kill((pthread_t)ptr_from_lispobj(tcr->osid), thread_suspend_signal) == 0) {
   ...

That's used by the GC (mostly) to suspend other threads; it's a little
hard to read, but it says "on Darwin, try to use the function 
mach_suspend_tcr(); if that fails (or if this isn't Darwin), use
pthread_kill() instead."

It seems to be the case that if mach_suspend_tcr() suspends a thread
that is in a certain state (has a pending exception), the Mach kernel
sends the exception message twice; this confuses the lisp kernel thread
that handles exception messages and it does some sort of illegal memory
access.  (Nothing handles exceptions that occur in that thread, and
the segfault kills the process.)

It'd be nice if this (mach_suspend_tcr) worked in all cases, but in
the short term it's nicer not to provoke this ... Changing the

#ifdef DARWIN

to

#if 0


in that function seems to avoid the problem.

(And yes, several people have run into this, and there really should
be a version that either does the right thing inside mach_suspend_tcr()
or wimps out soon.)




>
> jjm
>
> MCL Version 5.1b4
> OpenMCL Version (Beta: Darwin) 0.14.3!
> Mac OS 10.4.2  Dual G5 2.5 GHz.
>
>
>



More information about the Openmcl-devel mailing list