[Openmcl-devel] OpenMCL, Intel, Rosetta

Fri Jan 13 14:22:55 PST 2006

On Fri, 13 Jan 2006, Gary King wrote:

> Gary,
>
> Thanks again for this post. If you have time, I would like to know
> how exactly OpenMCL makes use of precise exceptions? Also, will the
> port to AMD64 help to work around Rosetta? 
>
> Thanks

[This may be way too much information and way too much detail; I
don't know how to be concise about this.]

First of all, I hope that the meme "precise exceptions" (in Apple's use
of the term) doesn't make it into civilized discourse.  The term has
at least one other meaning; it often refers to whether floating-point
exceptions are detected and reported as they occur (e.g., on overflow
or divide-by-zero) or whether they're reported later (perhaps on the
next FPU operation or after some number of bus cycles or something.)
If I recall correctly, FP exceptions involving the traditional X87
FPU used in Intel-based hardware are imprecise, FP exceptions involving
SSE/SSE2 FP vector units in newer Intel-based hardware are precise, and
the PPC offers both precise and imprecise FP exceptions (selectable
by some bits in the "Machine Status Register" (MSR)); OpenMCL usually
runs with MSR bits that select precise exception reporting.

Since the term "precise exceptions" has another meaning, I'd have
preferred that Apple had used another term.  Unfortunately, I can't
think of anything concise and catchy, so I'd have to recommend that
we instead discuss "exceptions whose handlers receive non-garbage
values in their signal contexts."  (Yes, I know that that doesn't
exactly roll off the tongue; perhaps it will with practice.)  The
statement in the Rosetta documentation that I quoted yesterday
might therefore be paraphrased as:

"Rosetta does not support exceptions whose handlers receive non-garbage
values in their signal contexts ..."

I phrased that slightly differently when I reported it as a bug.

Concisely phrased or otherwise, I think that the important issue
is whether or not values in signal contexts are garbage or not.
(To back up for a second, a "signal context" is a data structure
that describes the state of a thread - machine registers, mostly -
at the time that an exception (or other "signal", in Unix terms)
occurred.  A handler function for an exception - a "signal handler" -
can opt to recieve a pointer to such a data structure as an argument.)
To be able to handle an exception - to do something beyond printing
a message saying "something unexpected happened" and terminating the
program - a handler generally needs to be able to access and in some
cases modify the state of the suspended thread as it's presented in
the signal context.  If the signal context contains garbage, the
handler won't be able to proceed (and may itself generate exceptions.)

To try to get back on-topic and answer your first question: OpenMCL
uses exceptions in several ways and for several reasons, and expects
handlers for those exceptions to have accurate context information.
In this context, it's important to remember that an exception is
something that's expected to be atypical (but not necessarily fatal).
Some examples include:

- type-checking.  If you disassemble the function CDR, you'll see
(along with some overkill prologue and epilogue code):

(CLRLWI IMM0 ARG_Z 30)  ; "clear the left 30 bits" = "extract the right 2 bits"
                         ;  of the value in the arg_z register; store the
 			;  2 rightmost bits of arg_z in the imm0 register
(TWNEI IMM0 1)          ; generate a trap exception if those bits aren't 1
 			;  Note that ppc3::tag-cons has the value 1.
(LWZ ARG_Z -1 ARG_Z)	; we know that the value in arg_z was a cons, so
 			;  it's safe to access its cdr (which happens to
 			;  be -1 - aka ppc32::cons.cdr - bytes from where
 			;  the tagged pointer was pointing.)

That's roughly equivalent to (and a lot simpler than)

(CLRLWI IMM0 ARG_Z 30)  ; same as above
(CMPWI IMM0 1)		; compare the extracted tag bits to ppc32::tag-cons
(BEQ+ @ok)		; branch around the error, predicting that the
 			;  branch will be taken
(call type-error)	; this is likely to really be at least a
 			; few instructions, and possibly several
@ok			; now we can take the CDR safely
(LWZ ARG_Z -1 ARG_Z)	; same as above

If you call (CDR 17) (or try to do that in safe compiled code), the
conditional trap will be taken; the hardware will invoke a handler
in the OS kernel, which will in turn will call a handler in the lisp
kernel.  All the lisp kernel's handler knows at this point is that
a SIGTRAP "signal" was generated; it needs to be able to look
at the instruction(s) which preceded the trap in order to handle
it (by calling out to lisp code and reporting a type error.)  It
also seems better to hide all of this overhead and compexity in
the exceptional case and to streamline the typical case as much as
possible.

- to help maintain the illusion of infinite memory.  Memory sadly isn't
an infinite resource, but allocating it is a lot simpler if you consider
the case when it's not available.  You can disassemble #'CONS if you
want to see the details, but the general idea is:

(move-the-current-thread's-free-memory-pointer-closer-to-the-end-of-its-free-memory-pool)
(trap-if-it-went-past-the-end)
(use-the-newly-allocated-CONS-cell)

That's sort of the same idea; whenever a thread runs out of memory,
a trap is taken.  The handler for that trap may simply allocate
another chunk (typically about 64K bytes, IIRC) for the thread (which
is fairly simple) or GC (which is very complicated) if the trap is
taken; "typically", it isn't taken (64K is enough for about 8K CONS
cells, for instance, so the trap would only be taken about 1/8Kth
of the time.

- there are other examples (and the details of this one are more
complicated), but the GC needs to be able to stop all other threads
(basically so that the thread running the GC has exclusive access to
memory), and it also needs to know what heap-allocated lisp objects
are refrenced from those other threads (from their stacks and
registers.)  One way of stopping the other threads and finding out
where there stacks are and what their registers contain is to send
signals to those threads and examine the signal contexts set up by the
handlers for those signals.  If those signal contexts contain random
garbage, the GC will either get confused and crash or (at best)
neglect to retain objects that are referenced only from the stacks and
registers of other threads.

There are other examples, but there's a deep, pervasive assumption that
exception handling mechanisms work and that it's worthwhile to exploit
those mechanisms, and the general idea is that hiding complexity and 
overhead in the exceptional case (e.g., the GC) makes the typical case
faster/cleaner/simpler/all-of-the-above.  It might be -possible- to
distribute the complexity/overhead and make things slower/muddier/more
comples/all-of-the-above instead, but doing this because Rosetta's
broken (or, to use the new term, "only supports exceptions whose handlers
have garbage in their signal contexts") doesn't make a lot of sense.

[A last aside; the way that people usually do instruction-set emulation
these days is via some form of dynamic translation, e.g., like the
Just In Time translation of JVM bytecodes to native code sequences.
If Rosetta is translating sequences of PPC instructions into sequences
of x86 instructions - and trying to spend much more time executing
those native instructions than it does doing the translation.  Since
there isn't necessarily a 1:1 correspondence between the emulated
instructions and the native ones, it may be hard to maintain information
about the emulated program counter.  Information about the emulated (PPC)
program counter seems to be the most significant information that's
missing, and if I were to guess I'd guess that this isn't coincidental.
The fact that it's hard to maintain emulated PC information without 
slowing down JIT translation or execution doesn't mean that it's
impossible; if this is the explanation for why the PC is garbage in
exception contexts, then the question becomes whether it's acceptable
to offer "exceptions whose handlers recieve garbage in their handler's
contexts".  I don't believe that it is, Apple apparently does, and
they're the ones who ship the hardware and software.]

To try to answer your second question (and to try to do so more
succinctly): Clozure does have a contract to port OpenMCL to x86-64
(AMD64) Linux, work is underway on this, and it's expected to be
available in 2Q 2006. (If you look very, very carefully at the Clozure
or OpenMCL web pages, you'll see mention of this, and we've been
planning to announce it on this list and other places soon.)  There
are some obvious similarities between the x86-64 architecture that
we're targeting and the x86-32 architecture that Apple's moving to,
but there are some significant differences as well (16 registers vs 8,
mostly).  Having done the x86-64 port, some parts of an x86-32 port
would be easier (i.e., it won't be necessary to write a new assembler,
endianness issues will have been identified, etc.) but some other
difficult parts would remain (mostly having to do with getting by with
so few registers and a generally brain-dead architecture, especialy
if we want to maintain things like precise, compacting GC.  We do, I
think.)

If Apple were to announce that they're going to start offering
x86-64-based hardware (and a 64-bit-capable version of OSX), it'd be
relatively straightforward to port form x86-64 Linux to (hypothetical)
x86-64 Darwin.  Apple's made no such announcement; were they to do
so, the shiny new machines announced in the last week might start
looking less shiny to many people.

I'm quasi-serious about the whole "precise exceptions meme" issue;
if we phrase the issue as whether or not it's acceptable to leave
randomness in exception contexts, rather than pretend that this is
some sort of unsupported option, we (for some value of "we") might
be able to persuade Apple that this is not in fact acceptable.
Running under Rosetta isn't as attractive as running natively on
x86-32, but it looks better than not running than not running at
all.