[Openmcl-devel] ARM immediates, subprims and NIL

Mon Jul 11 01:54:07 PDT 2005

On Sun, 10 Jul 2005, James Bielman wrote:

> Hi,
>
> I've made a fair bit of progress on the LAP assembler for my OpenMCL
> ARM port.  Now that I've started messing around with it a bit, I've
> got some more architectural problems to solve.
>
> The format of immediates in ARM data processing instructions is rather
> weird.  Instead of an N-bit integer, an immediate is encoded as an
> 8-bit integer rotated right by an even number of bits.

I think that you can actually do several flavors of shifts as well as
rotates, and an immediate shift/rotate count can anything between
0 and 31, inclusive.

This doesn't really affect your basic point (that 1-instruction
constant loads only work if the number of significant bits in the
constant is small)

>
> So, the following immediates are legal (using ARM LAP syntax):
>
>    (mov imm0 #xff)          ; #xff rotated right 0 bits
>    (mov imm0 #x2000)        ; #x80 rotated right 6 bits
>    (mov imm0 #xf000000f)    ; #xff rotated right 28 bits
>
> but these arent:
>
>    (mov imm0 #x101)
>    (mov arg_z #x2015)
>    (mov pc #x5000)
>
> So, assuming I want to be able to load NIL into a register in a single
> instruction (which seems like a good goal), I'll need to shuffle
> things around a bit.
>
> Furthermore, Windows CE helpfully declares the first 64k of memory as
> off-limits, so this wouldn't work there anyway.

NIL's basically a really popular constant (e.g., more functions
reference NIL more often than they reference PI or the symbol FOO).
I'm not sure that it's -so- frequently-used that it warrants being
kept in a dedicated register, especially if registers are in short
supply.  (You'd like it to be in a register if it was referenced in a
loop or otherwise referenced frequently, but the same is true of PI
and 'FOO.)

If you have control of the address space above #x10000 (and if you
want to keep the PPC32 tagging scheme), you can probaly get NIL
into a register with 2 ALU instructions (assuming that NIL is
at #x20015)

     (mov temp0 ($ (lsl 2 16)))   ; whatever the assembler syntax is
     (or temp0 temp0 ($ 15))

If NIL's going to be referenced often in a given function (and/or
in one or more loops), you'd probably have wanted to target SAVE0
instead of TEMP0.  If the compiler doesn't do a good job of this,
the 2 instructions above are more than we're used to seeing, but
not the end of the world.  (At least, I don't think so).

My argument that it'd be acceptable to treat NIL not-so-specially
might depend on the compiler doing a better job of both local
and global register allocation.  If you dissasemble:

(defun both-foo (x y)
   (if (and (eq x 'foo) (eq y 'foo))
     t
     nil))

you'll see:

   (LWZ ARG_Y 4 VSP)
   (LWZ ARG_Z 'FOO FN)
   (CMPW ARG_Y ARG_Z)
   (BNE L60)
   (LWZ ARG_Y 0 VSP)
   (LWZ ARG_Z 'FOO FN)
   (CMPW ARG_Y ARG_Z)
   (BNE L60)

the second load of 'FOO is redundant (and it's not too hard to prove
that.)

>
> I'm thinking that nil will need to go in a register, since even if I
> move nil to a location representable as a rotated immediate,
> nil-relative objects probably won't be.  I'm not sure which register
> to give up for this.
>
> So that leaves me with:
>
>   (mov arg_z rnil)               ; return nil
>   (add arg_z rnil t_offset)      ; return t
>
> (This is also a problem for loading immediate fixnums.  I've read it
> can take up to three instructions to load an arbitrary 32-bit constant
> using arithmetic instructions.  Maybe fixnums that can't be
> represented as immediates can go in the function's constant vector
> like other constant objects?  ARM assemblers do something similar with
> "literal pools".  Perhaps in the future I could do something fancier
> when optimize speed > optimize size or something...)

On PPC64, it can take as many as 5 instructions to load a 64-bit
immediate.  The literature that I've seen suggests that it's almost
always faster to load large constants from memory: if there's a data 
cache hit, the single LD instruction is likely to be faster than
the 5 ALU instructions.

>
> There is a similar problem with subprimitives.  ARM doesn't have a
> branch absolute instruction, but you can use PC as a destination
> register in data processing instructions.
>
> Because of the immediate representation, I can't do:
>
>   (mov lr pc)
>   (mov pc #x50bc)   ; this would have to be >#xffff on wince

Can you load the PC from the constants pool ?
>
> I'm not sure what the best way to solve this is yet.  Could I move the
> subprimitive jump table relative to nil as well?  If this worked, a
> subprim call could look something like:
>
>   (mov lr pc)
>   (add pc rnil .SPwhatever-offset)
>   ;; or maybe even (ldr pc (rnil .SPwhatever-offset))
>
> There's probably some sneakier, better way to do this...

The subprimitive jump table on the PPC supposedly costs nearly nothing:
unconditional branches get folded in the pipeline.

One register that I don't think that we can easily get rid of is the
TCR (if you lose track of it, you have to do something like
(#_pthread_getspecific) to recover it.)  We're used to thinking of the
TCR as containing strictly thread-specific info, but (if we're willing
to live with slightly larger TCRs) we can certainly copy some global
info into it.  Suppose that a given thread's TCR pointed into the
middle of a block of memory; the few-dozen words at non-negative
offsets from the TCR could be the current volatile thread-specific
stuff and the few dozen (hundred ?) words at negative offsets would
be copies of static information (like subprimitive addresses).

That'd make a subprim call something like:

     (mov lr pc)
     (ldr pc (rcontext .SPfuncall))  ; .SPfuncall is negative and a multiple
                                     ;  of 4

>
> James
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>