[Openmcl-devel] Is this a bug?

Wed Jan 25 03:59:56 PST 2012

On Tue, 24 Jan 2012, Ron Garret wrote:

>
> On Jan 24, 2012, at 10:00 PM, Gary Byers wrote:
>
>> I can't reproduce this
>
> Hm, it does seem to be different on the trunk.  I'm encountering the "problem" on the App Store version (1.7-store-r15140).
>
> FWIW, tweaking the code thusly:
>
> (let (p)
>  (setf p (MAKE-POPULATION :INITIAL-CONTENTS (LIST (COPY-LIST '(foo)))))
>  (setf x 1)  ; Clobber volatile registers
>  (gc)
>  (POPULATION-CONTENTS p))
>
> results in the code returning NIL on the app store version as well, which confirms your explanation.
>
> Interestingly, the disassembly of the unmodified version seems to be identical on both the app store version and the trunk.  I say "seems to be" because one of the things that apparently changed between these two versions is the disassembly output format (!) so I can't just run the output through diff, I have to eyeball it.  But it's mostly a moot point.

The difference is in how the volatile registers are used in MAKE-POPULATION
and whatever it calls; this affects their values on return from the call and
that (and what the SETQ of P does) affects their values when the GC is called.

In something very close to the app store version, I used CCL::DBG to enter
the kernel debugger and the kernel debugger's L command to print lisp values
of the registers.  That output included

%rdi (arg_y) = (FOO)

and there's the strong reference that affects what the GC does with the
population's contents.

>
> Thanks for taking the time to write up that detailed explanation!
>
> BTW, I notice a lot of NOPs in the disassembly.  Are those there for a reason?  (Feel free to treat that as a rhetorical question if you don't feel like spending more time on this.)

The NOPs usually preced CALL instructions; in some cases, they're used before JMP
instructions that actually implement CALLs but pass the return address in a register:

   ...
   (lea (@ lab (% fn)) (% temp2))
   [some NOPs)
   (jmp somewhere) ; usually to a kernel "subprimitive" that manipulates
                   ; the stack
lab
   (lea (l0 (% rip)) (% fn)) ; make the FN register point to the start of the function

and the case where CALL can be used is just:

  [some NOPs]
  (call somewhere)
lab
   (lea (l0 (% rip)) (% fn)) ; make the FN register point to the start of the function

This likely isn't exactly the syntax that the disassembler would use, but it's
close.

The disassembler will show the relative (to the start of the function) address
of each instruction.  On x8664, a function's code always starts at an address
whose low 4 bits are #xf; if we add #xf to the relative addresses of the return
addresses following the CALLs, we'll see that those return addresses always have
#x4 or #xc in their low 4 bits.  In other words, return addresses can be identified
and distinguished from other kinds of things on the stack by the values in their
low 4 bits (their "tag bits").  This is desirable in lots of contexts, but it's
really critical to the GC: the GC has to be able to reliably recognize return
addresses and treat them as such.  (A pointer "into" a function may be the only
thing that references an anonymous function and it's critical that the function
be retained if such a reference exists; the GC may want to move a function around
in memory, and it's critical that return addresses associated with the function
move with it.)

So, the NOPs that precede various flavors of CALL instructions exist to ensure
that return addresses are properly aligned/tagged.  (On x86, CALL pushes the
address of the next instruction on the stack.)  The x86 assembler has a
directive that performs this "tail alignment": ensures that an instruction
is preceded by enough NOPs to ensure that the instruction -ends- on a specified
alignment boundary.  The NOPs are sometimes preceded by instruction prefixes
that the disasssembler doesn't show; they obviously take up space in general
and do so in the instruction cache as well as in memory, but AFAIK their
actual execution cost is very low.

True Fact And Confession:  Early versions of CCL/OpenMCL for x8664 didn't do
this tail-alignment of CALLs and in fact avoided the use of CALL instructions
altogether; a function call was compiled as something like:

    ...
    (lea (@ lab (% fn)) (% temp2))
    (jmp somewhere)
    (:align 3) ; align PC to 8-byte boundary
    (:long offset-back-to-start-of-function)
lab
    ;;; IIRC, there was actually an LEA that reestablished the FN
    ;;; register here much as there is now.

Andi Kleen (a Linux kernel developer and someone who'd used OpenMCL for several
years at that point) wrote to me that avoiding the use of CALL like this had
serious performance problems.  On modern x86 hardware, there are history and
prediction mechanisms that try to match CALL and RET instructions so that
the effect of RET on the instruction pipeline was greatly reduced.  (This is
somewhat similar to how branch history and prediction mechanisms can reduce
the cost of conditional branches.)  Andi pointed out that avoiding CALL gave
these mechanisms no chance to kick in and made pipeline stalls much more common.

I expressed skepticism; Andi generated test cases and measured their performance
in the oprofile profiler; that output indeed showed the negative effects that Andi
had predicted.  We discussed the reason for my having avoided the use of CALL,
and I'm fairly sure that the idea of tail-aligning CALL instructions was something
that Andi suggested.  I changed the implementation and many things got notably
taster.

When it became time to describe this change in release notes, I realized that
I wasn't sure what pronoun to use:  I was familiar with "Andi" (with that
spelling) as a woman's name but I wasn't really sure if that applied here.
I didn't know if Andi was a man or a woman; it wasn't really relevant, but
I wasn't really sure how to refer to them when acknowledging their help.
I realized that I could just ask, but I never got around to doing that and
I don't think that I ever wrote anything for public consumption that explained
why CALL instructions were often preceded by NOPs.

It's about 5 years late, but now that I've finally written that explanation
(and Googled a bit), I want to thank  Mr. Andi Kleen for making me aware of
the problem, convincing me of its severity, and suggesting viable workarounds.
Had he not done so, it might have taken a long time for me to recognize the
issue (and that might not have happened at all.)

>
> rg
>
>