[Openmcl-devel] CCL crash on windows (with multithreading and ffi callbacks)

Tue May 3 08:11:15 PDT 2011

03.05.2011, 03:05, "Gary Byers" <gb at clozure.com>:
> On Tue, 3 May 2011, Anton Vodonosov wrote:
>
>>  02.05.2011, 06:59, "Gary Byers" <gb at clozure.com>;:
>>>  I'd recommend debugging.
>>>
>>>  Is the callback (locking-callback) called ? ?Do its arguments look plausible ?
>>>  Do BT:ACQUIRE-LOCK and BT:RELEASE-LOCK do the right thing(s) ? ?If things
>>>  seem to work when this callback isn't installed and don't when it is, then one
>>>  could suspect either the mechanics of the callback or the code it calls; if
>>>  (hypothetically) BT:ACQUIRE-LOCK and BT:RELEASE-LOCK didn't work, then it
>>>  wouldn't be too surprising if a multithreaded application that relied on those
>>>  things working didn't work.
>>  As for BT:ACQUIRE-LOCK and other callback impl. details, they are ruled out
>>  by the fact that the crash reproduces the same way if we leave the callback
>>  body empty.
>>
>>  And the crash happens not the because of absence of proper synchronization -
>>  if we do not register the callback at all, the crash doesn't happen.
>>>  If the "mechanics of the callback" - receiving arguments from and returning
>>>  results from foreign code - were at fault, then that is something that CCL
>>>  (and to some extent CFFI) is responsible for and a problem there would
>>>  almost certainly be a bug in CCL. ?The callbacks that seem to be involved
>>>  don't -look- too unusual, but one never knows.
>>  Arguments are passed OK to the callback - the strings and numbers we would
>>  expect. So it doesn't seem to be a stack corruption or something in that fashion.
>>
>>  Also interesting is that the callback is called thousands of times before the crash
>>  happens.
>>>  If someone isolates the problem as a ?CCL bug, then I'd certainly be interested
>>>  in trying to fix it. ?It's potentially a lot of work just to isolate the problem;
>>>  I wish that I coul say (well, sort of wish ...) that I had time and interest
>>>  in doing that, but I quite frankly have neither. ?(I don't even know where
>>>  things like QuickLisp put the sources to the systems that it downloads, and I
>>>  don't have the attention span or patience or whatever to learn that.)
>>>
>>>  It sounds like you've already done quite a bit to narrow this down;
>>>  adding a few calls to BREAK or PRINT or FORMAT in appropriate places
>>>  might do a lot to help isolate the problem to the point where someone
>>>  could actually do something about it.
>>  I tried already PRINT and FORMAT. I hoped maybe you have some debugging
>>  technique which allows you to find out crash reasons somehow immediately.
>>
>>  The symptoms are strange. I don't know, maybe it's not FFI directly, maybe
>>  Windows CCL is not thread safe with some basic data structures (e.g. CONSes)
>>  which happen to be used in FFI implementations. As I said, I also observed
>>  crashes without FFI.
>
> I'm not sure that I understand this.  There's essentially and intentionally
> no way to pass a lisp object like a CONS to foreign code; that indeed wouldn't
> be thread-safe.

I meant CCL:DEFINE-CALLBACK most likely creates/manipulates some lisp data
(for internal purposes, registers the lisp function somehow as a callback). 
And when the callback is being called, this lisp data is probably also somehow accessed.

CFFI probably does similar things. So I mean not passing a lisp object to foreign code,
but what happens in the lisp world when the callback is being called.

>
> So, the problem's either the callback, or it isn't ...

Yes, precisely )

>
> Is the symptom consistently a crash into the kernel debugger with a
> complaint that an exception occurred in foreign code (it may say
> "on foreign stack") ?  Or does it die in a variety of ways ?
> If it does die in foreign code, please send me the output of the
> kernel debugger's 'r' command.  I probably won't be able to tell
> what foreign code it's in, but might be able to tell if it's in
> CCL's GC or elsewhere.

It doesn't fall into the kernel debugger. Windows just shows a message
box like "The program wx86cl64.exe is crashed" (it's an approximate translation,
the actual message is in Russian). and just removes the process. 
Nothing appears in the CCL console.

It is always the same way.

> What C SSL library is used (if any) ?  
OpenSSL is  from here:
http://www.slproweb.com/products/Win32OpenSSL.html
It's build with Visual C++ 2008.

> Is it known for certain that the
> interfaces that that library offers match those that CL+SSL define ?
> (Note that one difference between Win64 and Unixy 64-bit platforms is
> that the C "long" type (and variants like "unsigned long") are only
> 32 bits wide on Win64; on other platforms, they're 64 bits wide.  This
> generally doesn't matter much when passing scalar arguments/returning
> scalar values - those values will be correct in their low 32 bits but
> may have undefined upper halves, so it "doesn't matter much" from the
> point of view of stack discipline - but that can matter a lot more
> visibly when structures have "long" fields.

No, I can only say is "I am not aware of any incompatibilities". If speak about 
the callback, it's arguments are int's.

>
>>  Well. If I find anything more I'll report here.
>>
>>  Best regards,
>>  - Anton