[Openmcl-devel] CCL crash on windows (with multithreading and ffi callbacks)

Gary Byers gb at clozure.com
Mon May 2 16:05:54 PDT 2011

On Tue, 3 May 2011, Anton Vodonosov wrote:

> 02.05.2011, 06:59, "Gary Byers" <gb at clozure.com>:
>> I'd recommend debugging.
>> Is the callback (locking-callback) called ? ?Do its arguments look plausible ?
>> Do BT:ACQUIRE-LOCK and BT:RELEASE-LOCK do the right thing(s) ? ?If things
>> seem to work when this callback isn't installed and don't when it is, then one
>> could suspect either the mechanics of the callback or the code it calls; if
>> (hypothetically) BT:ACQUIRE-LOCK and BT:RELEASE-LOCK didn't work, then it
>> wouldn't be too surprising if a multithreaded application that relied on those
>> things working didn't work.
> As for BT:ACQUIRE-LOCK and other callback impl. details, they are ruled out
> by the fact that the crash reproduces the same way if we leave the callback
> body empty.
> And the crash happens not the because of absence of proper synchronization -
> if we do not register the callback at all, the crash doesn't happen.
>> If the "mechanics of the callback" - receiving arguments from and returning
>> results from foreign code - were at fault, then that is something that CCL
>> (and to some extent CFFI) is responsible for and a problem there would
>> almost certainly be a bug in CCL. ?The callbacks that seem to be involved
>> don't -look- too unusual, but one never knows.
> Arguments are passed OK to the callback - the strings and numbers we would
> expect. So it doesn't seem to be a stack corruption or something in that fashion.
> Also interesting is that the callback is called thousands of times before the crash
> happens.
>> If someone isolates the problem as a ?CCL bug, then I'd certainly be interested
>> in trying to fix it. ?It's potentially a lot of work just to isolate the problem;
>> I wish that I coul say (well, sort of wish ...) that I had time and interest
>> in doing that, but I quite frankly have neither. ?(I don't even know where
>> things like QuickLisp put the sources to the systems that it downloads, and I
>> don't have the attention span or patience or whatever to learn that.)
>> It sounds like you've already done quite a bit to narrow this down;
>> adding a few calls to BREAK or PRINT or FORMAT in appropriate places
>> might do a lot to help isolate the problem to the point where someone
>> could actually do something about it.
> I tried already PRINT and FORMAT. I hoped maybe you have some debugging
> technique which allows you to find out crash reasons somehow immediately.
> The symptoms are strange. I don't know, maybe it's not FFI directly, maybe
> Windows CCL is not thread safe with some basic data structures (e.g. CONSes)
> which happen to be used in FFI implementations. As I said, I also observed
> crashes without FFI.

I'm not sure that I understand this.  There's essentially and intentionally
no way to pass a lisp object like a CONS to foreign code; that indeed wouldn't
be thread-safe.

So, the problem's either the callback, or it isn't ...

Is the symptom consistently a crash into the kernel debugger with a
complaint that an exception occurred in foreign code (it may say
"on foreign stack") ?  Or does it die in a variety of ways ?
If it does die in foreign code, please send me the output of the
kernel debugger's 'r' command.  I probably won't be able to tell
what foreign code it's in, but might be able to tell if it's in
CCL's GC or elsewhere.

What C SSL library is used (if any) ?  Is it known for certain that the
interfaces that that library offers match those that CL+SSL define ?
(Note that one difference between Win64 and Unixy 64-bit platforms is
that the C "long" type (and variants like "unsigned long") are only
32 bits wide on Win64; on other platforms, they're 64 bits wide.  This
generally doesn't matter much when passing scalar arguments/returning
scalar values - those values will be correct in their low 32 bits but
may have undefined upper halves, so it "doesn't matter much" from the
point of view of stack discipline - but that can matter a lot more
visibly when structures have "long" fields.

> Well. If I find anything more I'll report here.
> Best regards,
> - Anton

More information about the Openmcl-devel mailing list