[Openmcl-devel] Why does this "cheat"/"lie" not work?...

Sat Feb 6 09:59:45 PST 2010

On Fri, 2010-02-05 at 18:48 -0700, Gary Byers wrote:

> Matt gave a lightning talk at last year's ILC explaining what the bit-setting
> and clearing are all about.
> 
> <http://www.thoughtstuff.com/rme/weblog/?p=17>

Good stuff!

> A short version is that we have to assume that the GC can run on any
> instruction boundary (because of activity in other threads.)
> ... Good short description of issue ...

Thanks.  So this is indeed thread/gc based, but due more to the paucity
of registers on x8632.

> > Also, what is the (calll (@ .SPMAKES32)) for?  Again, given the context
> > of all the lying.
> 
> Um, "lack of support for immediate operations on signed integers of
> the native word size" ?

Hmmmm :-)

> CCL's support for operations on unboxed integers that fit in a machine
> word  in general is poor, but what support exists is oriented towards
> unsigned integers (there's no good reason for excluding signed integers;
> that support just isn't there.)

That makes the unsigned version seem even odder.  Here is the generated
code when type spec is UNsigned 32:

(disassemble (symbol-function 'replace-int-vec))
  [0]     (recover-fn)
  [5]     (movl (% ebp) (@ 16 (% esp)))
  [9]     (leal (@ 16 (% esp)) (% ebp))
  [13]    (popl (@ 4 (% ebp)))
  [16]    (pushl (% arg_y))
  [17]    (pushl (% arg_z))
  [18]    (jmpl L168)
L23
  [23]    (movl (@ -8 (% ebp)) (% arg_y))
  [26]    (movl (@ -20 (% ebp)) (% arg_z))
  [29]    (movl (% arg_z) (% imm0))
  [31]    (movl (@ -2 (% arg_y) (% imm0)) (% imm0))
  [35]    (nop)
  [36]    (leal (@ 0 (% arg_y)) (% arg_y))
  [40]    (calll (@ .SPMAKEU32))
  [47]    (recover-fn)
  [52]    (movl (% arg_z) (% temp1))
  [54]    (movl ($ 3) (% imm0))
  [59]    (testl (% imm0) (% temp1))
  [61]    (movl (% temp1) (% imm0))
  [63]    (jne L70)
  [65]    (sarl ($ 2) (% imm0))
  [68]    (jmp L116)
L70
  [70]    (andl ($ 3) (% imm0))
  [73]    (cmpl ($ 2) (% imm0))
  [76]    (jne L194)
  [78]    (movl (@ -6 (% temp1)) (% imm0))
  [81]    (cmpl ($ 519) (% imm0))
  [87]    (je L106)
  [89]    (cmpl ($ 263) (% imm0))
  [95]    (jne L194)
  [97]    (movl (@ -2 (% temp1)) (% imm0))
  [100]   (testl (% imm0) (% imm0))
  [102]   (js L194)
  [104]   (jmp L116)
L106
  [106]   (movl (@ 2 (% temp1)) (% imm0))
  [109]   (testl (% imm0) (% imm0))
  [111]   (jne L194)
  [113]   (movl (@ -2 (% temp1)) (% imm0))
L116
  [116]   (movl (@ -12 (% ebp)) (% arg_y))
  [119]   (movl (@ -4 (% ebp)) (% temp0))
  [122]   (btrl ($ 2) (@ (% fs) 8))
  [132]   (movl (% arg_y) (% temp1))
  [134]   (movl (% imm0) (@ -2 (% temp0) (% temp1)))
  [138]   (xorl (% temp1) (% temp1))
  [140]   (btsl ($ 2) (@ (% fs) 8))
  [150]   (movl (@ -12 (% ebp)) (% arg_z))
  [153]   (addl ($ 4) (% arg_z))
  [156]   (movl (% arg_z) (@ -12 (% ebp)))
  [159]   (movl (@ -20 (% ebp)) (% arg_z))
  [162]   (addl ($ 4) (% arg_z))
  [165]   (movl (% arg_z) (@ -20 (% ebp)))
L168
  [168]   (movl (@ -12 (% ebp)) (% arg_y))
  [171]   (movl (@ -16 (% ebp)) (% arg_z))
  [174]   (cmpl (% arg_z) (% arg_y))
  [176]   (jl L23)
  [182]   (movl (@ -4 (% ebp)) (% arg_z))
  [185]   (leavel)
  [186]   (retl)
L194
  [194]   (uuo-error-reg-not-type (% temp1) ($ 156))

Hmmm.  I'm not exceptionally adroit at reading x86 code (pretty fugly
stuff), but comparing this to the signed case seems to indicate that
while there is more overall generated code, the actual instructions
executed in a "non exceptional" run is maybe the same or even a few less
than the signed case.  Timing the two variants supports this, with the
unsigned case being slightly faster.  Not by much, but consistently
slightly better.

> In x86-64 CCL, the inner part of a loop that copies between two
> vectors of type (SIMPLE-ARRAY (UNSIGNED-BYTE 64) (*)) A and B looks
> like:
> 
> ;;; (aref b i)
> L29
>    [29]    (movq (@ -5 (% save2) (% save0)) (% imm0))
> 
> ;;; (setf (aref a i) (aref b i))
>    [34]    (movq (% imm0) (@ -5 (% save1) (% save0)))
> 
> which is fairly reasonable; the analogous case with vectors

Quite.  That looks very good.

> of element type (SIGNED-BYTE 64) is considerably less so: there's
> some completely unnecesary boxing and unboxing between those two
> instructions.  I'd expect the x86-32 code be roughly equivalent
> to the code above in the (UNSIGNED-BYTE 32) case and don't know
> why it isn't.  Matt's goofing off rather than working on a Friday
> night, so we'll have to wait for the answer.)

Interesting.  As you can see from the generated code, the unsigned case
is very similar to the signed case.  It would be very nice if it were
closer to the x86-64 code.  But then, should probably be using 64 bit
machines anyway.

> Since FIXNUM and (SIGNED-BYTE 32) aren't the same thing, I wouldn't
> expect the results to be the same ... Is the result corrupted/bogus
> in some other way ?

OK, but since the only "operations" here are just load and store, I
would have thought the fixnum case would have simply moved, unchanged,
four bytes per whack.  It seems as if the value is changed either on
load or store.

> > If you change word-type-spec to (UNsigned-byte 32), this also works, but
> > generates quite a bit more code than the signed version.  I think it was
> > about another 50 bytes or so in the code vector.
> >
> 
> It shouldn't be doing boxing/unboxing in either case, but code to handle
> an intermediate BIGNUM that may have been unnecessarily created is more
> complicated if the integer's unsigned.

I don't see where bignums show up in the code (unless that is what
the .SPMAKES32/.SPMAKEU32 are about).  So, is the boxing/unboxing you
mention the business about maintaining integrity of whether a "node" or
immediate is in a given register?  Or am I still missing something?

> > In Allegro, all the lying here pretty much does exactly what you (or at
> > least I) would expect.  No calls to anything, and just a handful of movl
> > instructions.  And, of course, it also works.  Actually here's what they
> > generate:
> 
> I remember things like "polling for interrupts on function entry".
> (Not fondly, but I remember them.)  Putting arbitrary values in arbitrary
> registers at arbitrary times because of a belief that the GC can only
> occur on certain instruction boundaries is something that I hope to have
> completely forgotten.

And just to drive the point home with further insult:  those system
interrupt checks must be _very_ expensive.  Despite the much tighter
loop body here, the ACL version is actually about 3.5 times _slower_
than the CCL variants (either signed or unsigned cases).  Ha!

> There are two issues here:
> 
> 1) CCL's support for dealing with immediate word-sized integers (that aren't
>     known to be FIXNUMs) is poor in general.  It can be improved, but:
> 2) There are and always will be constraints on register usage in CCL (imposed
>     by the combination of native threads and precise GC) that will likely
>     limit some of those improvements.  (If there were a choice where all other
>     things were equal and the options were between generating better code for
>     unboxed arithmetic and reliable support for precise GC/native threads -
>     and there were no way to have both - then I can't see that there's a real
>     choice here.)

Understood now, and I agree about the choice here.

> In your example, we're a ways from the kind of choice described above: it's
> certainly -possible- to avoid the boxing here without violating GC constraints.
> It may not be possible to avoid some kind of runtime manipulation of the
> node-regs-mask on x8632 (or GC or other costs associated with other schemes:
> any sort of dynamic register partitioning scheme is going to cost -something-
> somewhere.)  In this case, if the CCL code avoided the boxing but needed
> to do a lot of dynamic register partitioning on x8632, I wouldn't feel too bad
> about that.

OTOH, 32 bit machines are likely going the way of 16 bit machines for
any "serious" applications!  So, maybe the right view here is something
like, "yeah, the x8632 code isn't quite as good as it could be, but who
cares".

/Jon