[Openmcl-devel] Asynchronous callback made from real-time thread

Fri May 30 15:19:35 PDT 2003

On Fri, 30 May 2003, Letz Stephane wrote:

> >On Fri, 30 May 2003, Letz Stephane wrote:
> >
> >>  Hi,
> >>
> >>  We are trying to interface a Midi package (MidiShare :
> >>  www.grame.fr/MidiShare) with OpenMCL. Using MidiShare allows a Lisp
> >>  program to access a full Midi API : receive, send  Midi events, tasks
> >>  management...
> >>
> >>  For real-time handling of incoming Midi events, there is a safe
> >>  solution that consist in doing polling on the incoming event fifo in
> >>  a Lisp thread.
> >>
> >>  We tried to implement a more real-time way by calling back the Lisp
> >>  code directly from the real-time Midi thread using a lisp callback
> >>  defined with defcallback.
> >>
> >>  This seems to almost work on 0.13.5 version but crash after some
> >>  time. It crash immediately with the 0.14 alpha version.
> >>
> >
> >I have a hunch that the "seems to work on 0.13.5" part means "is
> >able to run for a while scribbling over some other threads' stacks
> >before anything actually notices."
>
> Yes exactly....
> ,
>
> >
> >Do you remember how it crashes in 0.14 ?  If it dies with a kernel
> >debugger message saying something like "No TCR for thread ...", that's
> >actually fairly mundane.
>
> It gives this kind of messages:
>
> Thread 6 Crashed:
>   #0   0x00122de4 in tsd_get (thread_manager.c:361)
>   #1   0x00123850 in get_tcr (thread_manager.c:681)
>   #2   0x00123ac4 in suspend_other_threads (thread_manager.c:745)
>   #3   0x001215e0 in Bug (lisp-exceptions.c:2442)
>   #4   0x00123888 in get_tcr (thread_manager.c:688)
>   #5   0x00123ac4 in suspend_other_threads (thread_manager.c:745)
>   #
>
>

Yuck.

Each lisp thread in OpenMCL uses 3 stacks.  One of them (called the
"control stack") is shared with C; the others (the "value stack" and
"temp stack") are lisp-specific.

Every native thread that can run lisp code needs to have a
thread-specific data structure - called a "thread context record", or
TCR - that says where its stacks are, whether it's currently running
lisp code or foreign code, what the underlying POSIX and Mach/Linux
thread identifiers are, whether it's in some sort of suspended state
and what values were in the machine registers when it was suspended if
so, where its special variable bindings and catch frames are, etc.
The GC needs access to this information, and the thread obviously
needs a lot of it as well.

All active threads' TCRs are linked together in a doubly-linked list,
so given the current thread's TCR it's possible to find all TCRs.

A lisp process is a couple of layers of abstraction around a TCR.
(There's an intervening layer called a LISP-THREAD, but I think that
it's more trouble than it's worth.)

When a lisp thread is created (via MAKE-PROCESS/PROCESS-RUN-FUNCTION),
one of the first things that happens is that a TCR is allocated and
initialized.  (Part of the allocation involves creating the extra
stacks for the lisp thread.)  The TCR for the thread is made into a
"thread-local variable", using the POSIX #_pthread_getspecific/
#_pthread_setspecific functionality.  When lisp code is running, the
current thread's TCR is kept in a register (r2); foreign code either
doesn't preserve r2 (Darwin) or uses it for some other purpose (newer
versions of Linux glibc), so foreign function calls save and restore
the current thread's TCR around the call.

When foreign code creates a thread, that thread doesn't have a TCR
and so can't (yet) run any lisp code.

When a callback occurs, the current thread's TCR needs to be found;
this requires a lookup of the "key" used to identify TCRs in the
thread's thread-local variables.  If that lookup occurs on a foreign
thread (that hasn't created a TCR), it'll fail.

What's supposed to eventually happen in that case is that a TCR
should be allocated (more-or-less as happens when a lisp thread is
created) and the callback continues as normal.  Subsequent callbacks
made from that thread should find the TCR and use it.  That code
isn't written yet, and what happens instead is a call into the
kernel debugger.

The kernel debugger wants to suspend all other threads (to make
debugging a little less insane ...).  To do this, it tries to find the
current thread's TCR, so that it can find all other threads by walking
the doubly-linked list.  Trying to find the current thread's TCR fails
and causes the kernel debugger to be entered.  The kernel debugger
tries to find the current thread's TCR so that it can suspend other
threads.  After several thousand iterations of this, the stack
overflows ...

The backtrace you sent showed a few levels of this infinite recursion.
You didn't see the message that I thought you would, but it was trying
-real- hard to show it to you ...

> >Calling lisp code from any native thread (other than the initial one)
> >can't possibly work in 0.13.5.
> >
> >Calling lisp code from native threads is ... what happens all the time
> >in 0.14.  (If lisp didn't create the thread, there's some lisp-specific
> >setup that's supposed to happen on the first callback on such a thread;
> >once that happens, the thread should look like any other.)
>
> Could you explain this a little more?

The "lisp-specific setup" I was referring to is the creation of a TCR
for the thread.

There's some additional complexity that has to happen on the lisp side
of things when that foreign thread starts running lisp code.  What
should the value of *CURRENT-PROCESS* be in that thread ?  If non-null,
should that process object be accessible via (ALL-PROCESSES) ? What
happens to the process object when the foreign thread exits ?  (I think
that these questions are the real reasons that get_tcr() isn't fully
written yet ...)

> >The GC in OpenMCL is fairly quick, but it's not real-time or very
> >close to it.  It's also not concurrent: it suspends all other threads
> >while it's running.  I don't know what kind of real-time constraints
> >might be involved in MIDI programming, but (if and when things are
> >running reliably) you probably want to minimize consing.
>
> Is there a way to disable GC during the realtime callback?

Yes.  (Whether or not it's a good idea is another question.  If a GC
is "missed" because the GC is disabled, it should probably be forced
to happen as soon as the GC is enabled again; this doesn't happen
yet.  When this works reasonably, (CCL::WITHOUT-GCING &body body) should
be exported and documented; until then, "use it at your own risk, like
the hash table code does")

>
> Having a blocking wait on the event fifo is probably possible, but
> i'm interested also to know more about the "complex" solution (:

I hope that you now know much more than you wanted to about some
parts of the problem; in particular, I hope that it's clear that
the problem isn't running code "asynchronously" in 0.14 (all code
runs "asynchronously", in that lisp isn't scheduling it) so much
as it is the fact that lisp didn't create the thread that's making
the callback and needs to do some things to make it look like it
would if it -had- created it.

Getting foreign threads to be treated like processes (making the
value of *CURRENT-PROCESS* meaningful, having lisp PROCESS objects
for them, having them show up in (ALL-PROCESSES) and :PROC, etc)
is harder than just allocating a TCR would be.  If your event handler
callback can refrain from thinking that it's running in a lisp process,
there isn't much new code involved in creating a TCR; getting the other
stuff to work would probably require some ABI changes.

>
> Thanks
>
> Stephane Letz
>
>

_______________________________________________
Openmcl-devel mailing list
Openmcl-devel at clozure.com
http://clozure.com/cgi-bin/mailman/listinfo/openmcl-devel