[Openmcl-devel] Why does this "cheat"/"lie" not work?...

Mon Feb 8 17:44:18 PST 2010

On Feb 5, 2010, at 8:48 PM, Gary Byers wrote:

> On Fri, 5 Feb 2010, Jon S. Anthony wrote:
> 
>> L20
>> [20]    (movl (@ -8 (% ebp)) (% arg_y))
>> [23]    (movl (@ -20 (% ebp)) (% arg_z))
>> [26]    (movl (% arg_z) (% imm0))
>> [28]    (movl (@ -2 (% arg_y) (% imm0)) (% imm0))
>> [32]    (calll (@ .SPMAKES32))
>> [39]    (recover-fn)
>> [44]    (movl (% arg_z) (% temp1))
>> [46]    (movl (% temp1) (% imm0))
>> [48]    (sarl ($ 2) (% imm0))
>> [51]    (testl ($ 3) (% temp1))
>> [57]    (je L81)
>> [59]    (movl (% temp1) (% imm0))
>> [61]    (andl ($ 3) (% imm0))
>> [64]    (cmpl ($ 2) (% imm0))
>> [67]    (jne L154)
>> [69]    (cmpl ($ 263) (@ -6 (% temp1)))
>> [76]    (movl (@ -2 (% temp1)) (% imm0))
>> [79]    (jne L154)
>> L81
>> [81]    (movl (@ -12 (% ebp)) (% arg_y))
>> [84]    (movl (@ -4 (% ebp)) (% temp0))
>> [87]    (btrl ($ 2) (@ (% fs) 8))
>> [97]    (movl (% arg_y) (% temp1))
>> [99]    (movl (% imm0) (@ -2 (% temp0) (% temp1)))
>> [103]   (xorl (% temp1) (% temp1))
>> [105]   (btsl ($ 2) (@ (% fs) 8))
>> [115]   (movl (@ -12 (% ebp)) (% arg_z))
>> [118]   (addl ($ 4) (% arg_z))
>> [121]   (movl (% arg_z) (@ -12 (% ebp)))
>> [124]   (movl (@ -20 (% ebp)) (% arg_z))
>> [127]   (addl ($ 4) (% arg_z))
>> [130]   (movl (% arg_z) (@ -20 (% ebp)))
>> L133
>> [133]   (movl (@ -12 (% ebp)) (% arg_y))
>> [136]   (movl (@ -16 (% ebp)) (% arg_z))
>> [139]   (cmpl (% arg_z) (% arg_y))
>> [141]   (jl L20)
>> 
>> [20] through [105] are the "interesting" bits.  There seems to be a fair
>> amount of shifting (sarl) and bit testing (testl, andl, compl, btrl,
>> btsl, xorl) going on.  I would have thought this chunk of code would
>> basically be a handful of movl (four to six or so) and that's it.  I
>> mean I think I lied pretty good here (what with (speed 3) (safety 0) and
>> loads of type annotation).  Is some (all) of this associated with thread
>> issues (conditional store in arrays or some such)?
> 
> Matt gave a lightning talk at last year's ILC explaining what the bit-setting
> and clearing are all about.
> 
> <http://www.thoughtstuff.com/rme/weblog/?p=17>
> 
> used to link to his materials from that talk; we changed servers a few
> months ago, and that stuff seems to not have been copied over.

I copied over that information this past weekend, so it should now be available at http://www.clozure.com/~rme/

>> Also, what is the (calll (@ .SPMAKES32)) for?  Again, given the context
>> of all the lying.
> 
> Um, "lack of support for immediate operations on signed integers of
> the native word size" ?
> 
> CCL's support for operations on unboxed integers that fit in a machine
> word  in general is poor, but what support exists is oriented towards
> unsigned integers (there's no good reason for excluding signed integers;
> that support just isn't there.)
> 
> In x86-64 CCL, the inner part of a loop that copies between two
> vectors of type (SIMPLE-ARRAY (UNSIGNED-BYTE 64) (*)) A and B looks
> like:
> 
> ;;; (aref b i)
> L29
>   [29]    (movq (@ -5 (% save2) (% save0)) (% imm0))
> 
> ;;; (setf (aref a i) (aref b i))
>   [34]    (movq (% imm0) (@ -5 (% save1) (% save0)))
> 
> which is fairly reasonable; the analogous case with vectors
> of element type (SIGNED-BYTE 64) is considerably less so: there's
> some completely unnecesary boxing and unboxing between those two
> instructions.  I'd expect the x86-32 code be roughly equivalent
> to the code above in the (UNSIGNED-BYTE 32) case and don't know
> why it isn't.  Matt's goofing off rather than working on a Friday
> night, so we'll have to wait for the answer.)

The x86-64 version of CCL existed before the 32-bit x86 version, so I was adding 32-bit support to an existing 64-bit backend rather than the (probably more usual) other way around.

There is some compiler support for dealing with elements of type (unsigned-byte 64), which is the native word size on x86-64.  The 64-bit backend doesn't do anything special with (unsigned-byte 32) elements---it just boxes the result (which is fairly cheap, since we know it will fit in a fixnum).  The code that dealt with (unsigned-byte 32) elements "just worked" on 32-bit systems too (although on 32-bit x86, the boxing isn't necessarily cheap---see the calls to .SPmakeu32).  So, there's no deep reason for all the boxing and unboxing on x8632.  What can I say?  Bad hacker, no cookie.

We can do a little better:

;;; using a slightly patched compiler

(defun copy-u32-vector (src dest)
  (declare (type (simple-array (unsigned-byte 32)) src dest)
	   (optimize (speed 3) (safety 0)))
  (dotimes (i (length dest))
    (setf (aref dest i) (aref src i))))

;;; (aref src i)
L33
  [33]    (movl (@ -4 (% ebp)) (% arg_y))
  [36]    (movl (@ -16 (% ebp)) (% arg_z))
  [39]    (movl (% arg_z) (% imm0))
  [41]    (movl (@ -2 (% arg_y) (% imm0)) (% imm0))

;;; (setf (aref dest i) (aref src i))
  [45]    (movl (@ -16 (% ebp)) (% arg_y))
  [48]    (movl (@ -8 (% ebp)) (% temp0))
  [51]    (btrl ($ 2) (@ (% fs) 8))
  [61]    (movl (% arg_y) (% temp1))
  [63]    (movl (% imm0) (@ -2 (% temp0) (% temp1)))
  [67]    (xorl (% temp1) (% temp1))
  [69]    (btsl ($ 2) (@ (% fs) 8))

;;; (dotimes (i (length dest)) (setf (aref dest i) (aref src i)))
  [79]    (movl (@ -16 (% ebp)) (% arg_z))
  [82]    (addl ($ 4) (% arg_z))
  [85]    (movl (% arg_z) (@ -16 (% ebp)))
L88
  [88]    (movl (@ -16 (% ebp)) (% arg_y))
  [91]    (movl (@ -12 (% ebp)) (% arg_z))
  [94]    (cmpl (% arg_z) (% arg_y))
  [96]    (jl L33)

There's a lot of stack traffic (can we have some registers here, please?), and we end up doing the mark-as-imm/mark-as-node dance on %temp1 within the loop, but it's less dreadful.

I'll clean these changes up and commit them soon.