[Openmcl-devel] apple invents tagging?

Mon Jul 25 01:59:34 PDT 2011

Now It Can Be Told!  (Perhaps more concisely than this.)

Many lisp implemenations use the low few bits of a machine word to indicate
at least something about a lisp object's type; this generally works because
heap-allocated objects are aligned in memory in such a way that the low few
bits of the object's address are basically redundant.

If a CONS cell is two words, then a 32-bit implemenatation might plausibly
decide that all heap-allocated lisp objects should be aligned on 8-byte
boundaries.  A CONS cell allocated at address #x100000 can be identified
by any of the values #x100000, #x100001, #x100002, ... #x100007, at least.
If we only need the top 29 bits to identify the address of an aligned
heap-allocated object such as a CONS, we can use different values of the
low 3 bits to indicate the object's (primitive) type.  We obviously can't
encode all lisp types this way, but we can encode a few of the most important
ones (and use some other patterns to say "more type information is stored
in memory.")  The "low few bits" in a scheme like this are sometimes called
"tag bits", but it may be more appropriate to call them "low tag bits".
(Schemes that encode type information in the high bits of a pointer are
probably less common; they used to depend on the hardware not decoding the
high bits of addresses - why would anyone ever want more than 16MB of memory ?

A good story (which may or may not be true, and apologies to those
named if it isn't) is that JonL White was reading a VAX manual while
considering how to implement VAX NIL ("New Implementation of Lisp", a
CL precursor) and made some comment to the effect that a high-tagging
scheme seemed impractical on the VAX; RMS allegedly overheard, thought
for a very short time, and basically invented the low-tagging scheme
described above on the spot.  Whether that story's true (or whether I've
garbled some of the details) or not, I think that it's true that low-tagging
schemes have been used since at least the early days of the NIL project (mid-
to-late 1970s, I think.)

There are all kinds of variations in how concrete low-tag values map to
primitive types; these variations often attempt to exploit machine idioms
or comply with machine restrictions.  (CCL uses schemes that're at least
slightly different on each of the 5 architectures that it runs on.)  One
very common approach is to use tag values whose low few bits are 0 to represent
FIXNUMs (so that FIXNUMs are essentially integer values shifted left a few bits);
this scheme allows common operations like addition/subtraction to be performed
on the boxed FIXNUM objects themselves.  Some other tag values might be used
to represent other kinds of immediate (non-heap-allocated) objects (CHARACTERs,
etc.)

Suppose in a 32-bit implementation CONS cells had a tag of 1 and that
the CAR preceded the CDR in memory.  If P is a tagged pointer to a
CONS (with that tag of 1), then (CAR P) references P-1 and (CDR P)
references P+3.  Most architectures support addressing modes
consisting of a register +/- a signed integer displacement, so this
also means that we don't have to "mask out the tag bits" in order to
use a tagged pointer directly.  (I don't think that the Itanium
supports a register+displacement addressing mode, and the Motorola
88000 - which was the intended successor to the 68000 before Apple
switched to the PPC - originally only had "register+unsigned
displacement".  Another one of those May Or May Not Be True Stories is
that Franz supposedly convinced Motorola to change this; if this is true,
it suggests that no one thought of having a tagged pointer point a few
bytes before an object instead of a few bytes after it ...)

Anyway, back to 2011 and Lion ...

Historically, an ObjC object has been a pointer (allocated as if by #_malloc
and with enough of its low bits clear that vector registers can be stored
at natural alignment relative to that pointer, e.g., on a 16-byte memory
boundary) whose first word is a pointer to the object's class; subsequent
words may be concrete representations of/abstract info about the object's
instance variables.  This means that "an instance of the NSNumber class,
representing an "int" with value 1 might look something like:

pointer to aligned memory ->  class pointer, some subclass of NSNumber
                               maybe some Core Foundation info
                               some bits indicating that the value is an int
                               the integer 1
                               <padding for alignment, possible malloc overhead>

That's simple and consistent, but if this kind of "simple NSNumber" is
heavily used, the costs of allocating/deallocating/dereferencing these
things just to access a 32-bit value might be pretty high.  (Compare
this to the effort that lisp implementations put into trying to make
operations of FIXNUMs as cheap as they can, either by representing
them as immediate objects or by trying to avoid naively allocating
them in memory.)

In at least the 64-bit ObjC runtime on Lion, some instances of
NSNumber are represented as immediate objects with some of its low
bits (tag bits) having a distinguished non-zero value, a few other
bits used to distinguish the type of the value being represented, and
the remaining bits representing the value.  That isn't a bad idea
(it's at the very least a step in the right direction), but it came as
a bit of a surprise (CCL's ObjC bridge assumed that all ObjC instances
had the canonical heap-allocated representation) and I wish that the
scheme was a little friendlier to introspection (e.g., that there was
a supported way of finding out what tag values mapped to what classes;
as far as I know - just from reverse-engineering a late beta - this
is only used for certain concrete internal subclasses of NSNumber.)
On the one hand, I'd like to see this better documented, but on the
other I have to admit that I really don't want to hear about how
Apple's invented tagging ...

On Sun, 24 Jul 2011, Jason E. Aten wrote:

> 'Apple Invents Tagging' has been mentioned a couple of times now, so I'm
> curious--
> 
> By tagging, do you mean providing runtime type information in the first few
> bytes of an object as layed out in memory.? Or something else completely?
> 
> Thanks!
> 
> Jason
> 
> On Sun, Jul 24, 2011 at 1:07 AM, Gary Byers <gb at clozure.com> wrote:
>       surprise but is intentional and isn't a bad idea: ?"Apple
>       Invents Tagging!").
> 
> 
>