[Openmcl-devel] ccl crashes

Gary Byers gb at clozure.com
Wed Jan 20 02:54:12 PST 2010


CCL's kernel handles SIGSEGV; having the process terminate with a
message saying "Segmentation violation" generally means that the OS
isn't signaling SIGSEGV in a way that allows a handler to run (e.g.,
it's just terminating the process with extreme prejudice.)  One thing
that can cause this is recursion in a signal handler (a thread gets
an exception, the OS kernel pushes a bunch of context info on the
thread's stack and calls a handler, that handler gets an exception,
the OS kernel pushes a bunch of context info on the thread's stack
and calls a handler, the handler gets an exception ... at some point,
that stack overflows or is about to and the OS notices that there's
no room for the context and abruptly terminates the process with
an unhandled SIGSEGV.

(It -may- also be the case that this kind of abrupt termination can
occur in extreme system-wide low-memory situations; it may happen on
Linux if Linux decides that more memory could be freed if there just
weren't so many processes using memory and starts killing processes,
and it may happen if a process tries to use memory that Linux has
overcommitted.  I'm not sure if I'm remembering either of these
low-memory cases correctly.)

Let's just say that having the process terminated abruptly like that
is a bit unusual and usually indicative of a very severe problem.  I
can't imagine anything in your code that might cause that response and
I can't reproduce it either.  (There are certainly things to go wrong
in your example: SQRT isn't inlined, so calls to SQRT with a
DOUBLE-FLOAT argument will cons.  That case of SQRT is implemented as
a foreign function call, so the GC is often running when another
thread is in the middle of consing or transitioning between foreign
and lisp code.)

Like I said, there are things to go wrong there, but none of those
things "should" be stressed be code like this, and it's hard to see
how a failure there could lead to the abrupt termination that you're
seeing.

In the case where you got an error while running F in a single thread:
X should get increasingly close to 1.0d0 until it becomes 1.0d0, and
the SQRT of 1.0d0 is 1.0d0 (so we basically converge fairly quickly
and then repeatedly do (SETQ X (SQRT 1.0d0)).  X should never be
negative, so the path that SQRT is taking and where the error is
occurring should never occur.

This makes no sense either, and I can't reproduce it.  At this point,
it just seems like you're experiencing "random flakiness" (to use the
technical term ...), and CCL may or may not be a contributing factor
to that.




On Tue, 19 Jan 2010, Mario S. Mommer wrote:

>
> Hi,
>
> the attached file contains some simple code that can be used to crash
> ccl 1.4. I downloaded ccl 1.4 using subversion. I'm running it on Ubuntu
> Jaunty (9.04), 64 bits, and on a stock kernel.
>
> uname -a gives
>
> Linux padme 2.6.28-17-generic #58-Ubuntu SMP Tue Dec 1 21:27:25 UTC 2009 x86_64 GNU/Linux
>
> What I do is start two threads doing a simple computation intended to
> use cycles. The observed behavior is erratic, as it takes a variable
> time before ccl exits with a segfault:
>
> mommer at padme:~/local/src/ccl$ ls
> cocoa-ide         doc       library         lx86cl.image  x86-headers64
> compiler          examples  lisp-kernel     objc-bridge   xdump
> contrib           level-0   lx86cl          scripts
> coretest.lisp     level-1   lx86cl64        tools
> coretest.lx64fsl  lib       lx86cl64.image  x86-headers
> mommer at padme:~/local/src/ccl$ lx86cl64
> Welcome to Clozure Common Lisp Version 1.4-r13119  (LinuxX8664)!
> ?  (compile-file "coretest.lisp")
> #P"/home/mommer/local/src/ccl/coretest.lx64fsl"
> NIL
> NIL
> ? (load "coretest")
> #P"/home/mommer/local/src/ccl/coretest.lx64fsl"
> ? (coretest::start-threads 2 #'coretest::consing-f)
> NIL
> ? Segmentation fault
> mommer at padme:~/local/src/ccl$
>
> It makes a difference if I compile the file or not (in the later case,
> there is some chance that it won't crash). Also, if I do as follows,
>
> mommer at padme:~/local/src/ccl$ lx86cl64
> Welcome to Clozure Common Lisp Version 1.4-r13119  (LinuxX8664)!
> ? (defun f ()
>  (let ((x 2.0d0))
>    (dotimes (i most-positive-fixnum)
>      (setf x (sqrt x)))
>    x))
> F
> ? (loop repeat 2 collecting
>     (ccl:process-run-function (format nil "~A" (random 1.0d0)) #'f))
> (#<PROCESS 0.4779367736939041D0(2) [Active] #x30004116568D> #<PROCESS 0.9475384938042208D0(3) [Reset] #x3000411641AD>)
> ?
>
> then it doesn't crash, although sometimes (!) one thread decides it
> doesn't like it.
>
>> Error: value #C(0.0D0 1.0D0) is not of the expected type REAL.
>> While executing: COMPLEX, in process 0.9475384938042208D0(3).
>
>
> ;;;
> ;;; #<PROCESS 0.9475384938042208D0(3) [Active] #x30004113FB0D> requires access to Shared Terminal Input
> ;;; Type (:y 3) to yield control to this thread.
> ;;;
> (:y 3)
>
>
> ;;;
> ;;; Shared Terminal Input is now owned by #<PROCESS 0.9475384938042208D0(3) [Active] #x30004113FB0D>
> ;;;
>
>> Type :POP to abort, :R for a list of available restarts.
>> Type :? for other options.
> 1 > :B
> (7F0105D4DE38) : 0 (COMPLEX 0 #C(0.0D0 1.0D0)) 1669
> (7F0105D4DE88) : 1 (F) 61
> (7F0105D4DEB0) : 2 (RUN-PROCESS-INITIAL-FORM #<PROCESS 0.9475384938042208D0(3) [Active] #x30004113FB0D> (#<COMPILED-LEXICAL-CLOSURE # #x30004113F8AF>)) 725
> (7F0105D4DF48) : 3 (FUNCALL #'#<(:INTERNAL (CCL::%PROCESS-PRESET-INTERNAL (PROCESS)))> #<PROCESS 0.9475384938042208D0(3) [Active] #x30004113FB0D> (#<COMPILED-LEXICAL-CLOSURE # #x30004113F8AF>)) 389
> (7F0105D4DF98) : 4 (FUNCALL #'#<(:INTERNAL CCL::THREAD-MAKE-STARTUP-FUNCTION)>) 301
> 1 >
>
> Any ideas?
>
> Regards,
>
>        Mario.
>
> P.S: here is the code:
>
>



More information about the Openmcl-devel mailing list