[Openmcl-devel] Is this a bug?

Gary Byers gb at clozure.com
Tue Jan 24 22:00:28 PST 2012

You're missing (or not considering) something.

I can't reproduce this (though it'd be entirely believable if someone
else can).  The GC won't remove something from a POPULATION's contents
list if there's a strong reference to that thing from somewhere.  In
the case where P is lexically bound, between the point where we return
from MAKE-POPULATION and the time that the GC is invoked we probably
execute a single instruction (a MOV to the stack location or register
used for P); in the case where P is freely referenced (and no two
people can agree on what P "obviously" is, though many people have
opinions on this), whatever happens is more complicated.  (Part of
what happens may involve storing a "young" object - the
freshly-allocated POPULATION - into an "old" object - the value cell
of P - and this sort of intergenerational reference has to be detected
and handled carefully to support EGC constraints.)  Whatever happens
exactly likely involves several instructions and those instructions
will use several volatile registers (clobbering whatever values were
in those registers on return from MAKE-POPULATION.)  When all of that
register-clobbering is finished, the GC is invoked; it finds no strong
reference to the list in this case and removes the list from the
POPULATION's contents.

In other words, when the values that're in the volatile registers on
return from MAKE-POPULATION are clobbered (replaced with values having
to do with enforcing the EGC write barrier, or determining whether or
not P is dynamically bound in the current thread, or whatever happens
in the not-just-a-simple-lexical-SETQ case) the GC finds no strong reference
to the object on the POPULATION's contents and when those volatile registers
(mostly) have the same values on entry to the GC as they had on return from
MAKE-POPULATION it does find one or more strong reference.  That pretty much
indicates that any strong references that cause the object to be retained
are coming from the volatile registers.

The GC pretty much has to assume that any register value that a thread
could refererence will be referenced.  (That's a conservative
assumption, but the term "conservative GC" means something a bit
different.)  This generally means that any invocation of the GC will
retain everything that will be referenced as well as some things that
won't be but look like they could be.  Things that're retained because
of this conservative assumption will almost always be unreferenced on
the next GC and some other things will be retained, and unless the
compiler zeroes out registers whenever "it's through with them" or
annotates the machine code to say "at relative PC 17 in the current
function, the value in R9 has ceased to be interesting.  Oops, at
relative PC 21, it's interesting again ..." there's always going to be
some fuzziness involved.  (I think that CCL's actually fairly good about
minimizing some sources of fuzziness, but I don't think that it's practical
to try to completely eliminate it.)

So: the GC will remove things that're only weakly referenced, but there
can be non-obvious (and usually short-lived) strong references to things
- usually from volatile machine registers - that can cause the GC to have
a different view of when this should happen than you do.

There have been a lot of changes to how the compiler uses volatile registers
since 1.7.  As I said, I couldn't reproduce this.

The function CCL::DBG takes a single optional argument and enters the kernel
debugger; ordinarily, that argument is returned when the debugger is exited.
If I changed your code (either version) to do:

... (setf p (ccl::dbg (make-population ...))) ...

then I could use the kernel debugger's "l" command to see lisp
register values at the point where the call to MAKE-POPULATION has
returned.  In my case (on a couple of machines running a recent trunk
version of the x8664 port) the list that's the value of the population's
contents was in the temp0 register.  The calling convention says that
the symbol used to call a global function (GC in this case) should be
in temp0, so this particular strong reference gets clobbered on the way
to the GC (and the GC removes the weak reference from the population.)
A slightly different version of the compiler might have used another
volatile register (for lots of complicated reasons) and that register
might not have been clobbered by the time the GC actually runs.

This is a very long, overly detailed explanation for something that I
can't reproduce, but I think that it's a likely explanation.  I suppose that
it instead could be an outright GC bug, but GC bugs usually involve some
amount of pyrotechnics.

(If you do use CCL::DBG to look at things, the L command will identify
the registers that I'm calling "volatile" with names that start with "arg_"
or "temp".)

On Tue, 24 Jan 2012, Ron Garret wrote:

> (progn
>  (gc)
> Returns NIL as expected, but:
> (let (p)
>  (gc)
> returns ((FOO)).  Is this a bug or am I missing something?
> rg
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel

More information about the Openmcl-devel mailing list