[Openmcl-devel] ccl64 freebsd64 hunchentoot segfault

Thu Feb 9 03:48:28 PST 2012

I checked in (to the trunk) some changes to try to address the problem
that I encountered here, and after doing so I tried pounding on Hunchentoot
with the "ab" program and couldn't get it to fail in any way.  As I said,
I'm skeptical that the problem I (hopefully) fixed could have led to anything
but the deadlock that I saw.

The two most obvious differences between the system that I was using and yours
are:

  - you were running under VirtualBox and I was running on real hardware (that
    happens to be a 6-core "AMD Phenom(tm) II X6 1065T Processor")

  - you were running FreeBSD 9.0; I was running 8.2

I looked at the 9.0 release notes and didn't see mention of any change that
looked like it could cause this sort of problem.  I Googled a bit and saw
that there have been recent FreeBSD kernel changes (having to do with support
for the relatively new 256-bit vector registers ("AVX") available on some
processors, and depending on details, that could cause problems like those
that you reported.

I don't know whether those changes were made before or after the 9.0 release.
(It seems that the changes have been under development for several months.)

If this is the source of the problem, I wouldn't have a problem in saying
that CCL doesn't (yet) run on 9.0; it's certainly true that I haven't run
it under 9.0 yet and I don't think that anyone else here has, either.

My experience in trying to provoke misbehavior on 8.2 isn't conclusive, but
it seems to suggest that the problem you're seeing isn't present on 8.2/real
hardware.  If this is the result of an incompatible change in 9.0, we'll
obviously have to try to figure  out how to work around it, but it's not
clear how long that might take.

If the problem has to do with virtualization, there's generally less that
we can do about it besides reporting it as a bug; I don't even know enough
at this point to be able to write a coherent bug report.

At this point, I'm more inclined to think that this is related to 9.0-related
changes (simply because virtualization problems that we've seen in the past
weren't this subtle.)

If I could provoke the crash that you see on my 8.2/real-hardware system,
then all of this handwaving goes out the window.  I haven't yet been able
to.

On Wed, 8 Feb 2012, Gary Byers wrote:

> Thanks.  I wasn't able to reproduce this (running a trunk CCL on
> FreeBSD 8.2 on real hardware), but did see another severe problem.
> It's not clear how what I saw could cause what you saw, but until the
> cause of what I saw is eliminated it's probably not worth looking for
> anything else.  What I saw affects CCL on FreeBSD, might affect it on
> Solaris, doesn't currently affect it on Linux (but might in some
> future Linux version), and is pretty critical.  I think that it'll
> take anywhere from a few hours to a few days to fix; the fix will be
> made in the trunk, smoke-tested a bit, and then propagated to 1.7 in
> svn if it all seems to work and doesn't obviously break anything.
>
> Gory details follow.
>
> Most (perhaps all) implementations of malloc/free use a global lock to
> ensure that at most one thread in a process can modify heap data
> structures (and malloc/free and friends generally need to modify data
> structures without worrying about other threads trying to modify those
> datas structures at the same time.)  It's possible that some
> implementations could use atomic memory operations to keep things 
> thread-safe,
> but I don't know if any implemetations do so.
>
> CCL's GC runs on an arbitrary thread (usually whatever thread tries to do
> a memory allocation that would otherwise cause the heap to grow past a
> specified threshold.)  The GC isn't concurrent; on entry, it suspends all
> other lisp threads and on exit it resumes them.
>
> CCL's GC supports "gcable pointers"; these are used to support language
> constructs like MAKE-GCABLE-RECORD and are also used in the implementation
> of things like locks and semaphores.  Conceptually, when the GC discovers
> that certain foreign pointers are about to become garbage, it arranges to
> do a kind of adhoc finalization (also known as termination) on them.  The
> GC can't actually free the foreign memory assocated with the pointer while
> other threads are suspended, because some suspended thread might hold the
> malloc heap's lock (and the GC thread would deadlock, waiting forever for
> a lock held by a thread that's suspended and obviously can't release it.)
>
> To work around this, when certain kinds of gcable pointers (locks and
> semaphores) are discovered to be garbage, the GC does the work of freeing
> the object in two stages: the first stage runs immediately and does some
> sort of "deinitialization" of the pointer (telling the OS kernel that the
> semaphore isn't a semaphore anymore) and adding the pointer itself to a
> list; the second stage runs after the GC has allowed other threads to
> resume and calls free() on all of the "deinitialized" pointers on that
> list.
>
> That seems to work well on most platforms, but it assumes that (for
> instance) initializing a POSIX semaphore (via sem_init()) doesn't
> itself call malloc() and that deinitializaing one (via sem_destroy())
> doesn't call free().  That isn't a safe assumption in general (though
> it happens to be true in Linux) and isn't a correct assumption on
> FreeBSD.  (I haven't checked Solaris, Windows uses its own semaphore
> objects, and Apple hasn't invented POSIX semaphores yet AFAIK.)
>
> As I said, it's not clear to me how this could lead to termination
> via SIGSEGV (though it's clear that it can lead to the kind of deadlock
> that I saw), so that may be another bug (or something unique to some
> combination of FreeBSD 9.0 and virtualization.)  I'll try your test
> again after this is cleaned up.
>
>
>
>
> On Wed, 8 Feb 2012, Antony wrote:
>
>> On 2/7/2012 12:06 PM, Gary Byers wrote:
>>> I could speculate more, but I don't know how useful that'd be.  I don't
>>> know of any FreeBSD-specific CCL problems that might cause this but that
>>> doesn't mean too much either way.
>>> 
>> I am able to reproduce this without any of my code (Thanks to some 
>> prodding)
>> Following is what I did
>> run CCL as
>> 
>> CCL_DEFAULT_DIRECTORY=/home/antony/ccl.freebsd/ccl 
>> /home/antony/ccl.freebsd/ccl/scripts/ccl64
>> 
>> in the repl do the following
>> 
>> (load #P"/home/antony/git/thirdparty/asdf")
>> (asdf:initialize-source-registry
>> (list :source-registry (list :tree #P"/home/antony/git/thirdparty") ;;where 
>> hunchentoot and it's dependencies live
>>       :inherit-configuration))
>> (asdf:oos 'asdf:load-op :hunchentoot)
>> (defvar *https* (hunchentoot:start
>>           (make-instance 'hunchentoot:easy-ssl-acceptor :port 8083
>>                          :ssl-privatekey-password "xxxxxxx"
>>                          :ssl-certificate-file 
>> "~/git/config/https-cert/server.crt"
>>                          :ssl-privatekey-file 
>> "~/git/config/https-cert/server.key")))
>> 
>> Run apache bench as
>> ab -n 2000 -c 4 'https://xxxxx:8083/'
>> I get segfault after some requests
>> 
>> ab does not ignore  ssl cert errors (mine is self signed),
>> so you essentially get a series of aborted requests,
>> 
>> to make the test more complete, i got hold of the following script
>> this ignores the cert sign error and does full requests
>> (from 
>> http://stackoverflow.com/questions/189993/how-do-i-fix-ssl-handshake-failed-with-apachebench 
>> )
>> #--------------------------------------------------
>> #!/bin/bash
>> K=200;
>> HTTPSA='https://192.168.0.105:8083/'
>> date +%M-%S-%N
>> for (( c=1; c<=$K; c++ ))
>> do
>>    wget --no-check-certificate --secure-protocol=SSLv3 --spider $HTTPSA &
>> done
>> date +%M-%S-%N
>> #------------------------------------------------
>> 
>> and ran it concurrently as
>> sh qqqqqq.sh &  sh qqqqqq.sh &
>> this also caused segfault
>> 
>> But neither caused segfault on CCL+linux
>> 
>> The core file is still too big to email
>> 
>> -Antony
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>