[Openmcl-devel] Porting the OpenMCL Compiler

Thu Jul 7 15:56:06 PDT 2005

Gary Byers <gb at clozure.com> writes:

> Some PPC Linuces (I don't know about ARM Linux or WinCE) want to
> keep thread-specific information in a register (DarwinPPC64 wants to
> keep a pointer to the current pthread in R13), and the C runtime
> often gets confused and upset if this convention is violated.
> (OpenMCL's been trying to get away with violating it while lisp code
> is running, but weird things happen during exception handling and
> the next release will avoid Angering The TLS Gods.)  If the OS
> supports the concept of thread-local storage (TLS), it may be
> possible to make the lisp TCR be a thread-local variable (with a
> known offset within the block of thread-local variables that the
> ABI's thread-pointer points to), which would keep both the OS and
> Lisp happy without burning a register.

So far as I can tell, neither Windows CE nor the Linux ARM ABI require
a dedicated thread register (I haven't found anything that looks like
official Linux/ARM ABI documentation yet), so this sounds like a good
plan.

> It's nice to be able to cons inline, but I'm not sure if it's
> that important.
>
> On the PPC, "nargs" is only used in limited contexts (#args/#values),
> but (as far as the GC is concerned) it's just an immediate register.

Okay, here's where I am after shuffling things around:

r0      imm0            unboxed temp reg
r1      imm1            unboxed temp reg
r2      imm2/nargs      unboxed temp reg, number of arguments
r3      temp0           boxed temp reg
r4      temp1           boxed temp reg
r5      temp2           boxed temp reg
r6      save0           boxed callee-save reg
r7      save1           boxed callee-save reg
r8      save2           boxed callee-save reg
r9      arg_y           second to last argument
r10     arg_z           last argument
r11     fn              current function object
r12     vsp             value stack pointer
r13     sp              control stack pointer
r14     lr              link register
r15     pc              program counter

With the TCR (containing ALLOCPTR, ALLOCBASE, and TSP) in thread-local
storage.

> The sole reason why nargs isn't used (e.g., as another imm register
> in (SETF SBIT)) more generally is to simplify the interpretation of
> certain PPC trap instructions:
>
>    (twnei nargs 0)   ; means "trap if the current function got other
>                      ; than 0 arguments"
>
> whereas
>
>    (twnei imm0 0)   ; means "the object whose tag was extracted to imm0
>                     ; isn't a fixnum, but it really should be."
>
> If you worked harder at interpreting such traps, you could remove
> this restriction and make "nargs" a general-purpose immediate
> register.

This leads very nicely into my next architectural question. :-)

The ARM doesn't have the compare-and-trap instructions that the PPC
does.  There is an undefined instruction space that looks similar to
the UUOs on the PPC and I *think* these instructions are conditionally
executed (testing on a StrongARM PDA confirms this but I'm not sure if
it's required to do so, if this isn't guaranteed then I'll have to do
something completely different).

The undefined instruction space available for user extension has 16
bits of space to play with, so I should be able to fit some sort of
trap code (wrong # of arguments, wrong tag, etc), plus maybe a source
register and an immediate or other register (so I can spill the
expected argument count for 8191-arg functions to a register :-) into
the UUO.

So, the traps above might look sort of like (handwavy still):

    CMP nargs, 0
    MAKE_UUO(ne, check_nargs, nargs, 0)

or

    ANDS imm0, arg_z, #fixnum_mask    ; extract tag and set flags
    MAKE_UUO(ne, check_lisptag, arg_z, tag_fixnum)

Where MAKE_UUO is a macro taking an ARM condition code, a trap code
for the kernel, and a register/immediate to use in error reporting.

Alternatively, if this is getting too weird/complicated/whatever,
maybe there could just be subprimitives for the failed traps?
Something like:

    ANDS  imm0, arg_z, #fixnum_mask
    MOVNE pc, .SPargz_not_fixnum      ; or whatever...

(I guess this assumes the subprimitive can call into the kernel proper
without invoking a trap, which probably raises other issues I haven't
looked into...)

I imagine an x86 port would need to do something different here too.

(So far, trap handling on Windows CE seems to be a bit of a mess.  If
you use "structured exception handling" you have to build stack frames
in very specific ways so the "virtual unwinder" can emulate function
prolog instructions backwards to walk up the stack.

It requires information about each function in a section of the
executable as well; I'm not sure how one would build these data
structures when compiling at run-time.

Win32 has SetUnhandledExceptionFilter which can be used to get around
this but it is (of course) missing on CE...)

> There might also be ways of getting some flexibility (in some sense
> of the word) and still keeping a preemptively scheduled GC happy.
> Suppose that you were about to enter a loop where you really needed
> a bunch of IMM regs and had no use for some node regs (in that
> loop).  You -might- be able to do something like:
>
>   [set bits in tcr to mark node regs as immediate]

Ah cool, I'll keep that option open for when I get that far. :-)

> In the "last N args in registers" case, the naive approach is:
>
>    (call fn0)
>    (vpush arg_z)		     ; it's handy for the return value to go
>                               ; in arg_z
>    (call fn1)
>    (vpush arg_z)
>    (call fn2)
>    (vpush arg_z)
>    (call fn3)
>    (vpush arg_z)
>    (call fn4)
>    (vpop arg_y)
>    (vpop arg_z)
>
> That's probably only slightly better (VPOP is 2 instructions on the
> PPC, so we're only saving a VPUSH and a load to get to this point),
> but we also don't have anything under the outgoing arguments that'd
> have to be cleaned up eventually.  That seems to add up to a slight
> win; on the 68K, VPOP was a smaller/faster instruction than the
> LOADS would have been, so the difference was more pronounced.

Neat, that makes sense.  I think this should be nice for ARM too since
it should be possible to VPOP the last arguments in a single
instruction:

    (VPOP-YZ) ==> LDMIA vsp!, {arg_y, arg_z}

James