[Openmcl-devel] Why does this "cheat"/"lie" not work?...

Gary Byers gb at clozure.com
Fri Feb 5 17:48:18 PST 2010



On Fri, 5 Feb 2010, Jon S. Anthony wrote:

> Hi,
>
> I'm trying to cheat a bit in some low level sequence copying.  The
> sequences are at base simple arrays of unsigned byte 8.  But in copying
> them around I have the following which copies them in "word" size
> chunks, where word-size is machine dependent (basically 32 or 64):
>
> (defmacro word-type-spec ()
>  `(signed-byte ,+word-size+))
>
> (defmacro the-wda (x)
>  `(the (simple-array (signed-byte ,+word-size+) (*)) ,x))
>
> (defmacro wdref (arr index)
>  `(the (signed-byte ,+word-size+)
>     (aref (the-wda ,arr) (the fixnum ,index))))
>
>
> (defun replace-int-vec (vec1 vec2 start1 end1 start2)
>  (declare (optimize (speed 3) (safety 0) (space 0))
>           (type fixnum start1 end1 start2)
>           (type (simple-array (word-type-spec) (*)) vec1)
>           (type (simple-array (word-type-spec) (*)) vec2))
>  (while (fix< start1 end1)
>    (setf (wdref vec1 start1) (wdref vec2 start2)
>          start1 (fix+ start1 1)
>          start2 (fix+ start2 1)))
>  vec1)
>
> fix+, fix<, are just what you think and not of issue.
>
> This seems to work, but generates some "odd" (to me) code.  Also
> generates quite a bit more than expected:
>
> ? (disassemble (symbol-function 'replace-int-vec))
>  [0]     (recover-fn)
>  [5]     (movl (% ebp) (@ 16 (% esp)))
>  [9]     (leal (@ 16 (% esp)) (% ebp))
>  [13]    (popl (@ 4 (% ebp)))
>  [16]    (pushl (% arg_y))
>  [17]    (pushl (% arg_z))
>  [18]    (jmp L133)
> L20
>  [20]    (movl (@ -8 (% ebp)) (% arg_y))
>  [23]    (movl (@ -20 (% ebp)) (% arg_z))
>  [26]    (movl (% arg_z) (% imm0))
>  [28]    (movl (@ -2 (% arg_y) (% imm0)) (% imm0))
>  [32]    (calll (@ .SPMAKES32))
>  [39]    (recover-fn)
>  [44]    (movl (% arg_z) (% temp1))
>  [46]    (movl (% temp1) (% imm0))
>  [48]    (sarl ($ 2) (% imm0))
>  [51]    (testl ($ 3) (% temp1))
>  [57]    (je L81)
>  [59]    (movl (% temp1) (% imm0))
>  [61]    (andl ($ 3) (% imm0))
>  [64]    (cmpl ($ 2) (% imm0))
>  [67]    (jne L154)
>  [69]    (cmpl ($ 263) (@ -6 (% temp1)))
>  [76]    (movl (@ -2 (% temp1)) (% imm0))
>  [79]    (jne L154)
> L81
>  [81]    (movl (@ -12 (% ebp)) (% arg_y))
>  [84]    (movl (@ -4 (% ebp)) (% temp0))
>  [87]    (btrl ($ 2) (@ (% fs) 8))
>  [97]    (movl (% arg_y) (% temp1))
>  [99]    (movl (% imm0) (@ -2 (% temp0) (% temp1)))
>  [103]   (xorl (% temp1) (% temp1))
>  [105]   (btsl ($ 2) (@ (% fs) 8))
>  [115]   (movl (@ -12 (% ebp)) (% arg_z))
>  [118]   (addl ($ 4) (% arg_z))
>  [121]   (movl (% arg_z) (@ -12 (% ebp)))
>  [124]   (movl (@ -20 (% ebp)) (% arg_z))
>  [127]   (addl ($ 4) (% arg_z))
>  [130]   (movl (% arg_z) (@ -20 (% ebp)))
> L133
>  [133]   (movl (@ -12 (% ebp)) (% arg_y))
>  [136]   (movl (@ -16 (% ebp)) (% arg_z))
>  [139]   (cmpl (% arg_z) (% arg_y))
>  [141]   (jl L20)
>  [143]   (movl (@ -4 (% ebp)) (% arg_z))
>  [146]   (leavel)
>  [147]   (retl)
> L154
>  [154]   (uuo-error-reg-not-type (% temp1) ($ 157))
>
> [20] through [105] are the "interesting" bits.  There seems to be a fair
> amount of shifting (sarl) and bit testing (testl, andl, compl, btrl,
> btsl, xorl) going on.  I would have thought this chunk of code would
> basically be a handful of movl (four to six or so) and that's it.  I
> mean I think I lied pretty good here (what with (speed 3) (safety 0) and
> loads of type annotation).  Is some (all) of this associated with thread
> issues (conditional store in arrays or some such)?

Matt gave a lightning talk at last year's ILC explaining what the bit-setting
and clearing are all about.

<http://www.thoughtstuff.com/rme/weblog/?p=17>

used to link to his materials from that talk; we changed servers a few
months ago, and that stuff seems to not have been copied over.

A short version is that we have to assume that the GC can run on any
instruction boundary (because of activity in other threads.)  In order
for the GC to be tell whether a register contains pointer contains
a tagged lisp object (a "node" in the connectivity graph) or just
some random immediate value, some conventions have to be adopted
and very rigorously followed.  (You can be conservative and treat
things that might be nodes as if they were, if you do so carefully,
but the GC can't move things and reliably update references to them
unless the GC can assume that it can precisely distinguish between
nodes and immediates.)  On most architectures, the conventions that
allow this involve partitioning registers into "those that always
contain nodes" and "those that always contain immediates"; on 32-bit
x86, there just aren't enough registers to make a static partitioning
possible, so the partitioning is dynamic: there's a bitmask in thread-
local storage, and the convention is that a register contains a node
if and only if the bit associated with that register is set in the
bitmask.  Transitioning a register from "immediate" to "node" generally
involves ensuring that the register contains a harmless node value 
(rather than random immediate bits), so see things like:

>  [103]   (xorl (% temp1) (% temp1)) ; harmless value = 0
>  [105]   (btsl ($ 2) (@ (% fs) 8))  ; temp1 is a node again.

Matt's paper/slides discuss some other issues.

>
> Also, what is the (calll (@ .SPMAKES32)) for?  Again, given the context
> of all the lying.

Um, "lack of support for immediate operations on signed integers of
the native word size" ?

CCL's support for operations on unboxed integers that fit in a machine
word  in general is poor, but what support exists is oriented towards
unsigned integers (there's no good reason for excluding signed integers;
that support just isn't there.)

In x86-64 CCL, the inner part of a loop that copies between two
vectors of type (SIMPLE-ARRAY (UNSIGNED-BYTE 64) (*)) A and B looks
like:

;;; (aref b i)
L29
   [29]    (movq (@ -5 (% save2) (% save0)) (% imm0))

;;; (setf (aref a i) (aref b i))
   [34]    (movq (% imm0) (@ -5 (% save1) (% save0)))

which is fairly reasonable; the analogous case with vectors
of element type (SIGNED-BYTE 64) is considerably less so: there's
some completely unnecesary boxing and unboxing between those two
instructions.  I'd expect the x86-32 code be roughly equivalent
to the code above in the (UNSIGNED-BYTE 32) case and don't know
why it isn't.  Matt's goofing off rather than working on a Friday
night, so we'll have to wait for the answer.)


>
> If you change word-type-spec to be fixnum (and change accessors
> accordingly) the result is shorter with less testing (though the same
> call is there) but no longer works.  By that I mean it corrupts the
> destination array with bogus content.
>

Since FIXNUM and (SIGNED-BYTE 32) aren't the same thing, I wouldn't
expect the results to be the same ... Is the result corrupted/bogus
in some other way ?


> If you change word-type-spec to (UNsigned-byte 32), this also works, but
> generates quite a bit more code than the signed version.  I think it was
> about another 50 bytes or so in the code vector.
>

It shouldn't be doing boxing/unboxing in either case, but code to handle
an intermediate BIGNUM that may have been unnecessarily created is more
complicated if the integer's unsigned.

> In Allegro, all the lying here pretty much does exactly what you (or at
> least I) would expect.  No calls to anything, and just a handful of movl
> instructions.  And, of course, it also works.  Actually here's what they
> generate:

I remember things like "polling for interrupts on function entry".
(Not fondly, but I remember them.)  Putting arbitrary values in arbitrary
registers at arbitrary times because of a belief that the GC can only
occur on certain instruction boundaries is something that I hope to have
completely forgotten.

There are two issues here:

1) CCL's support for dealing with immediate word-sized integers (that aren't
    known to be FIXNUMs) is poor in general.  It can be improved, but:
2) There are and always will be constraints on register usage in CCL (imposed
    by the combination of native threads and precise GC) that will likely
    limit some of those improvements.  (If there were a choice where all other
    things were equal and the options were between generating better code for
    unboxed arithmetic and reliable support for precise GC/native threads -
    and there were no way to have both - then I can't see that there's a real
    choice here.)

In your example, we're a ways from the kind of choice described above: it's
certainly -possible- to avoid the boxing here without violating GC constraints.
It may not be possible to avoid some kind of runtime manipulation of the
node-regs-mask on x8632 (or GC or other costs associated with other schemes:
any sort of dynamic register partitioning scheme is going to cost -something-
somewhere.)  In this case, if the CCL code avoided the boxing but needed
to do a lot of dynamic register partitioning on x8632, I wouldn't feel too bad
about that.


>
> (disassemble (symbol-function 'replace-int-vec))
> ;; disassembly of #<Function REPLACE-INT-VEC>
> ;; formals: vec1 vec2 start1 end1 start2
>
> ;; code start: #x48851be4:
>   0: 55          pushl	ebp
>   1: 8b ec       movl	ebp,esp
>   3: 56          pushl	esi
>   4: 83 ec 2c    subl	esp,$44
>   7: 83 f9 05    cmpl	ecx,$5
>  10: 74 02       jz	14
>  12: cd 61       int	$97             ; sys::trap-argerr
>  14: 80 7f cb 00 cmpb	[edi-53],$0     ; sys::c_interrupt-pending
>  18: 74 02       jz	22
>  20: cd 64       int	$100            ; sys::trap-signal-hit
>  22: eb 31       jmp	73
>  24: 89 45 e4    movl	[ebp-28],eax    ; vec1
>  27: 8b 45 18    movl	eax,[ebp+24]
>  30: 8b 5c 02 f6 movl	ebx,[edx+eax-10]
>  34: 8b 45 e4    movl	eax,[ebp-28]    ; vec1
>  37: 8b 4d 10    movl	ecx,[ebp+16]
>  40: 89 5c 08 f6 movl	[eax+ecx-10],ebx
>  44: 8b 5d 10    movl	ebx,[ebp+16]
>  47: 83 c3 04    addl	ebx,$4
>  50: 8b 45 18    movl	eax,[ebp+24]
>  53: 83 c0 04    addl	eax,$4
>  56: 80 7f cb 00 cmpb	[edi-53],$0     ; sys::c_interrupt-pending
>  60: 74 02       jz	64
>  62: cd 64       int	$100            ; sys::trap-signal-hit
>  64: 89 45 18    movl	[ebp+24],eax
>  67: 8b 45 e4    movl	eax,[ebp-28]    ; vec1
>  70: 89 5d 10    movl	[ebp+16],ebx
>  73: 8b 5d 10    movl	ebx,[ebp+16]
>  76: 3b 5d 14    cmpl	ebx,[ebp+20]
>  79: 7c c7       jl	24
>  81: f8          clc
>  82: c9          leave
>  83: 8b 75 fc    movl	esi,[ebp-4]
>  86: c3          ret
>  87: 90          nop
>
> The bits corresponding to [2]-[105] in CCL are 24-40.
>
> Aside from some edification on why the extra code and the call out to
> some routine, I guess I'm wondering if there is any extra lying that
> could get CCL to generate something closer to the ACL code.  Note, this
> ACL is not using native threads...
>
> Thanks in advance!
>
> /Jon
>
>
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel
>
>



More information about the Openmcl-devel mailing list