[Openmcl-devel] Why does this "cheat"/"lie" not work?...

Sat Feb 6 07:20:17 PST 2010

Gary Byers wrote:
> 
> A short version is that we have to assume that the GC can run on any
> instruction boundary (because of activity in other threads.)  In order
> for the GC to be tell whether a register contains pointer contains
> a tagged lisp object (a "node" in the connectivity graph) or just
> some random immediate value, some conventions have to be adopted
> and very rigorously followed.
...
> CCL's support for operations on unboxed integers that fit in a machine
> word  in general is poor, but what support exists is oriented towards
> unsigned integers (there's no good reason for excluding signed integers;
> that support just isn't there.)
> 
> In x86-64 CCL, the inner part of a loop that copies between two
> vectors of type (SIMPLE-ARRAY (UNSIGNED-BYTE 64) (*)) A and B looks
> like:
> 
> ;;; (aref b i)
> L29
>    [29]    (movq (@ -5 (% save2) (% save0)) (% imm0))
> 
> ;;; (setf (aref a i) (aref b i))
>    [34]    (movq (% imm0) (@ -5 (% save1) (% save0)))
> 
> which is fairly reasonable; the analogous case with vectors
> of element type (SIGNED-BYTE 64) is considerably less so: there's
> some completely unnecesary boxing and unboxing between those two
> instructions.

AFAICS on 64 bit machine (UNSIGNED-BYTE 32) time (UNSIGNED-BYTE 32)
multiplication is done via .SPBUILTIN-TIMES and code using
(UNSIGNED-BYTE 32) is doing a lot of work converting between fixnum
format and (UNSIGNED-BYTE 32).  For example body of inner loop of
32 bit copy generates

;;; (ELT32 |op| \j)
  [78]    (movq (% save0) (% imm0))
  [81]    (shrq (% imm0))
  [84]    (movl (@ -5 (% save2) (% imm0)) (% imm0.l))
  [89]    (imulq ($ 8) (% imm0) (% temp0))
  [93]    (movq (% temp0) (% imm0))
  [96]    (shrq ($ 3) (% imm0))

;;; (SEQ (EXIT (SETELT32 |np| \j (ELT32 |op| \j))))
  [100]   (movq (% save0) (% imm2))
  [103]   (shrq (% imm2))
  [106]   (movl (% imm0.l) (@ -5 (% save1) (% imm2)))

So, array element is converted to fixnum and then immediately back to
immediate form.  Also, the is a lot of unnecessay copies, for
the loop above:

(movl (@ -5 (% save2) (% imm2)) (% imm0.l))
(movl (% imm0.l) (@ -5 (% save1) (% imm2)))

does the real work and there is still imm1 left to keep upper bound
of the loop...

> I remember things like "polling for interrupts on function entry".
> (Not fondly, but I remember them.)  Putting arbitrary values in arbitrary
> registers at arbitrary times because of a belief that the GC can only
> occur on certain instruction boundaries is something that I hope to have
> completely forgotten.
> 
> There are two issues here:
> 
> 1) CCL's support for dealing with immediate word-sized integers (that aren't
>     known to be FIXNUMs) is poor in general.  It can be improved, but:
> 2) There are and always will be constraints on register usage in CCL (imposed
>     by the combination of native threads and precise GC) that will likely
>     limit some of those improvements.  (If there were a choice where all other
>     things were equal and the options were between generating better code for
>     unboxed arithmetic and reliable support for precise GC/native threads -
>     and there were no way to have both - then I can't see that there's a real
>     choice here.)
> 
> In your example, we're a ways from the kind of choice described above: it's
> certainly -possible- to avoid the boxing here without violating GC constraints.
> It may not be possible to avoid some kind of runtime manipulation of the
> node-regs-mask on x8632 (or GC or other costs associated with other schemes:
> any sort of dynamic register partitioning scheme is going to cost -something-
> somewhere.)  In this case, if the CCL code avoided the boxing but needed
> to do a lot of dynamic register partitioning on x8632, I wouldn't feel too bad
> about that.
>

Maybe I am stating obvious, but have you considered keeping info about
changes in register use out of band?  Then garbage collector would
interpret the info and compute register mask.  Seems much cheaper
(both in space and in execution time) than current code to change mask.
Also, with sufficiently clever garbage collector it should be possible
to get effect of change of mask exactly at point when value transitions
from immediate to node (that is eliminate need to zero register before
change).

BTW: Is it possible to use dynamic partitioning in 64-bit code?  Only
3 immediate registers are very severe limitation.

-- 
                              Waldek Hebisch
hebisch at math.uni.wroc.pl