[Openmcl-devel] Porting the OpenMCL Compiler

Wed Jul 6 03:58:42 PDT 2005

On Wed, 6 Jul 2005, James Bielman wrote:

> Hi,
>
> I've been spending a fair bit of time studying the OpenMCL internals
> and I now think I'm dangerous enough to consider a port to ARM
> (probably Linux first, then hopefully Windows CE).
>

I actually have a Zaurus (under a pile of papers on my desk, I think ...)
and occasionally think of the same thing.

Any processor with a "SoftWare Interrupt if Not Equal" (SWINE)
instruction can't be all bad ...

> Obviously this is a huge task and I certainly don't expect to get very
> far anytime soon, but hopefully I can learn a lot from the process
> either way.
>
> I'm planning to start out implementing kernel subprimitives and the
> LAP assembler, but I'd like to get some advice on register usage,
> since the ARM has far fewer registers than PowerPC.
>
> Basically, there are 16 GPRs, r0-r15, except r15 is the PC, r14 is the
> link register, and r13 is typically the control stack pointer.  So,
> apart from any tricks to be done with reusing lr for other purposes
> (although I'd think that, being a GPR, it could fulfill the same
> purpose as the LOC-PC register on PPC?), we are left with 13 GPRs.

Yes; I don't think that you'd need a separate LOC-PC register on the
ARM (or if you did, you wouldn't need it very often.)

>
> I don't know if there are any guidelines about how many registers are
> necessary for it to be worth paritioning into boxed and unboxed.  If
> this isn't enough then obviously life gets more complicated...
>

On the 68K (which only had 16 registers), I remember people who wrote
LAP code saying that there never seemed to be enough immediate
registers.  (I don't remember how many there were, and they came in
two flavors, so the problem was often that there weren't enough 
unboxed data or address registers but more than enough of the other
kind.)

I think that when the PPC compiler does (SETF (SBIT bv idx val)) -
and neither "idx" or "val" is a constant - there are about 4 live values 
in immediate registers.  You can do things a little differently 
(re-calculate some of these values), and it may be more convenient
to spill some of these values to a stack than it's been on the PPC,
but it's also desirable to make primitives fast (and, on the ARM,
compact.)

> I wrote a little Lisp program to loop over all the fbound symbols in
> the OpenMCL image and disassemble them to a file, then grepped the
> output (hopefully correctly) to count register usage.  Here are the
> results:
>
> ARG_Z           132643
> IMM0            50129
> VSP             48012
> ARG_Y           34665
> FN              33836
> SAVE0           32224
> TEMP3           19587
> SAVE1           18077
> ALLOCPTR        11647
> SAVE2           11001
> TEMP4           9398
> LOC-PC          9239
> ARG_X           8769
> NARGS           8744
> SAVE3           7772
> TEMP0           7249
> TSP             5528
> SAVE4           5155
> TEMP2           4285
> SAVE5           3717
> SAVE7           3431
> SP              2963
> SAVE6           2747
> IMM1            2436
> ALLOCBASE       2045
> IMM2            839
> IMM3            517
> TEMP1           446
> RCONTEXT        293
> IMM4            237
> IMM5            13
>
> Based on this (and some possibly incorrect common sense), here's what
> I've got so far:

The static breakdown is interesting.  I'm not sure how to obtain a
dynamic breakdown; I'd -guess- that it'd show similar results, but any
differences would also be interesting.

>
> r0      imm0            unboxed temp reg
> r1      imm1            unboxed temp reg
> r2      temp0           boxed temp reg
> r3      temp1           boxed temp reg
> r4      save0           boxed caller-save reg
> r5      save1           boxed caller-save reg
> r6      arg_y           second to last argument
> r7      arg_z           last argument
> r8      nargs           number of function arguments
> r9      allocptr        heap free pointer
> r10     fn              current function object
> r11     rcontext        thread context register
> r12     vsp             value stack pointer
> r13     sp              control stack pointer
> r14     lr              link register
> r15     pc              program counter
>

Back in the 80s, people did some perforamnce studies (there was one or
more from the University of Utah and there were some by Benjamin Zorn
at the University of Colorado that I remember) of lisp programs;
someone determined that mean number of arguments to a function was
a little under 2 (counted dynamically) and a little over 2, so 2 argument
registers sounds about right.

Some PPC Linuces (I don't know about ARM Linux or WinCE) want to keep
thread-specific information in a register (DarwinPPC64 wants to keep
a pointer to the current pthread in R13), and the C runtime often
gets confused and upset if this convention is violated.  (OpenMCL's
been trying to get away with violating it while lisp code is running,
but weird things happen during exception handling and the next
release will avoid Angering The TLS Gods.)  If the OS supports the
concept of thread-local storage (TLS), it may be possible to make
the lisp TCR be a thread-local variable (with a known offset within
the block of thread-local variables that the ABI's thread-pointer points
to), which would keep both the OS and Lisp happy without burning a
register.

It's nice to be able to cons inline, but I'm not sure if it's
that important.

On the PPC, "nargs" is only used in limited contexts (#args/#values),
but (as far as the GC is concerned) it's just an immediate register.
The sole reason why nargs isn't used (e.g., as another imm register
in (SETF SBIT)) more generally is to simplify the interpretation
of certain PPC trap instructions:

   (twnei nargs 0)   ; means "trap if the current function got other
                     ; than 0 arguments"

whereas

   (twnei imm0 0)   ; means "the object whose tag was extracted to imm0
                    ; isn't a fixnum, but it really should be."

If you worked harder at interpreting such traps, you could remove
this restriction and make "nargs" a general-purpose immediate register.

There might also be ways of getting some flexibility (in some sense
of the word) and still keeping a preemptively scheduled GC happy.
Suppose that you were about to enter a loop where you really needed
a bunch of IMM regs and had no use for some node regs (in that loop).
You -might- be able to do something like:

   (vpush save0)
   (vpush save1)
   (li save0 0)   ; I'm lapsing into PPC assembler here ...
   (li save1 0)
   ;; now set some bits in the TCR somewhere that says that
   ;; save0/save1 are IMM regs, temporarily.
   (load save0 unboxed-stuff)
   (add save0 save1 save0) ; etc
   (li save0 0)
   (li save1 0)
   ;; clear those hypothetical TCR bits.
   (vpop save1)
   (vpop save2)

The GC'd have to cooperate somehow, and you'd have to be very
disciplined about using this, but it looks like it could be
made to work safely (and might be very useful.)

(I'm thinking about doing a port to a totally bizarre register-starved
architecture, and would think seriously about this approach.)

> This is assuming the temp stack pointer could be put in memory
> somewhere, perhaps in the tcr?  Also, this doesn't seem like very many
> immediate registers, but according to the register counts for PowerPC,
> maybe this isn't so bad?

It's certainly true that the compiler rarely uses more than one or
two imm regs at a time (I think that (SETF (SBIT ...)) is either the
worst case or very close to it.)  I guess that the question becomes
"if you -need- more than 2 imm regs, how bad is it not to have them ?",
and this probably comes up in a few LAP functions and subprimitives.
My intuition is to want at least 3 and possibly 4, but there may be
ways of avoiding the hard cases.

>
> I'm not sure what to use for NFN or FNAME either, hmm.

NFN's basically an extra argument on the PPC (we sort of call the
code vector and pass NFN as an argument.)  Splitting things up that
way has some nice properties (code-vectors are position-independent
and can be kept in readonly memory, FUNCTIONs are very orthogonal
and easy for the GC to deal with.)  On the ARM, it -may- be better
to keep the code and constants in the same object, so the whole
NFN/FN thing may disappear.  (The "current function" is "where the
PC is", sort of.)

FNAME is part of the canonical calling sequence only so that
if you call the thing that goes in the function cell of non-fbound
symbols it can say what symbol had no function definition.

>
> Also, I'm curious why OpenMCL uses registers for the last arguments
> instead of the first, is there a sneaky reason why this is so?

It was a little more dramatic on the 68K than it is on the PPC, and
it only really makes a difference when calling functions that take
arguments that have to be passed on the stack.  CL requires that
arguments be evaluated from left to right (or at least that this is
true with respect to side-effects.)  Suppose that we have a call to
the function FOO which takes 5 arguments, in this case the results
of callin FN0, FN1, FN2, FN3, and FN4 (each of which take 0 args
but which are assumed to have side-effects, so we must call them
in that order.)

If we passed the first 3 arguments in registers (arg_a, arg_b, arg_c)
and remaining arguments on the stack, we'd get code like this:

    (call fn0)
    (vpush canonical-result-reg) ; it doesn't matter here what
                                 ; register is "canonical-result-reg"
    (call fn1)
    (vpush canonical-result-reg)
    (call fn2)
    (vpush canonical-result-reg)
    (call fn3)
    (vpush canonical-result-reg)
    (call fn4)
    (vpush canoncal-result-reg)
    (load-word arg_a 4 vsp)
    (load-word arg_b 3 vsp)
    (load-word arg_c 2 vsp)

At this point, we can make the call to FOO; there are two outgoing
arguments on the top of the stack and 3 words (used to evaluate the
first 3 outgoing args) underneath them.  If we did things this way,
the caller would have to discard those 3 words either before the
call (moving the outgoing arguments down) or after.  (You can certainly
avoid this worst-case scenario by using other temporary stack locations
to hold the first 3 args or by using non-volatile registers instead
of stack temporaries, but neither of these strategies is absolutely
free.)

In the "last N args in registers" case, the naive approach is:

   (call fn0)
   (vpush arg_z)		     ; it's handy for the return value to go
                              ; in arg_z
   (call fn1)
   (vpush arg_z)
   (call fn2)
   (vpush arg_z)
   (call fn3)
   (vpush arg_z)
   (call fn4)
   (vpop arg_y)
   (vpop arg_z)

That's probably only slightly better (VPOP is 2 instructions on the
PPC, so we're only saving a VPUSH and a load to get to this point),
but we also don't have anything under the outgoing arguments that'd
have to be cleaned up eventually.  That seems to add up to a slight
win; on the 68K, VPOP was a smaller/faster instruction than the LOADS
would have been, so the difference was more pronounced.

>
> James
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>