[Openmcl-devel] Address-based hashing considered harmful. Maybe.

Gary Byers gb at clozure.com
Sun Mar 2 22:16:28 PST 2003



On Sat, 1 Mar 2003, Stonewall Ballard wrote:

> On 3/1/03 8:06 AM, "Gary Byers" <gb at clozure.com> wrote:
>
> >...
> > That seems to leave arrays and cons cells as objects without handy
> > immutable attributes that can be used to derive good hash values.
> > If we were to get rid of the concept of address-based hashing
> > in OpenMCL and chose not to add "unique ID" fields to cons cells
> > and arrays, EQ-based hash lookup on those types of keys would
> > degenerate to linear search.
>
> At least in the case of cons cells, how hard would it be to allocate them in
> their own zone and never move them? Since they're fixed-size, the only
> reason to move them is so that a scavenger can work uniformly on all
> objects.
>

There are other reasons to want to move things (improving locality and
paging behavior where relevant)  Compacting or copying (making allocated
objects contiguous) has the side effect of making free space contiguous
and makes allocation much simpler.

(Compaction/copying have costs as well, of course.)

> Arrays could have the extra ID field without that much overhead, I expect.

It doesn't sound like the space overhead would be too bad in the vector
case.

Adding an extra word to every cons cell means adding two words (because
of alignment constraints).  Doubling the size of every cons cell seems
like a heavy price to pay.

I'd asked if people used conses/strings/arrays as keys in EQ hash tables;
I got one answer in the affirmative, but could have answered my own
question.  COMPILE-FILE keeps all constants it encounters in an EQ hash
table, the *PRINT-CIRCLE* mechanism uses a hash table ... so I guess that
this isn't quite so esoteric after all.

>
> What if you marked hash tables that contain cons cells or arrays, and
> rehashed them as you do now only in those cases? "Pay as you go".

That's pretty much what happens now: there's a bit in the hash table
that (conservatively) indicates whether or not address-hashed keys have
been used.  Since address-based hashing has been used pervasively in
MCL and OpenMCL (in the belief that it's cheap), that bit's almost
set.  Functions like GETHASH start out by inhibiting the GC and worrying
about the need to rehash due to address-based key movement, and that's
not that unreasonable an assumption in the current state of the world.

If address-based hashing was the exception rather than the rule, a lot
of important cases (hash tables used by CLOS and the compiler) could be
made simpler and faster; I'm not sure that synchronization issues in fact
become simpler, but they certainly seem easier to think about.

At the moment, that sounds like a reasonable short-term approach (don't
eliminate address-based hashing, but don't use it where alternatives
exist and let the hash table code concentrate on the case where it's
not involved.)

Here's a (half-thought-out) idea: suppose that "support" for
address-based hashing was an attribute of the hash table - set via an
argument to MAKE-HASH-TABLE and not subsequently changeable.  If an
address-based key (something for which no other good hash
transformation exists) was used in a hash table that didn't provide
such support, lookup would devolve into linear search : it would still
"work", but would be grossly inefficient.  On the other hand, when
keys that had good hashing functions were used in such a table, the
code could safely ignore all of the issues related to address-based
hashing.

There's room for argument as to what the default value of :ADDRESS-BASED
should be, but if it was in effect things would work as they do now:
there'd be some significant paranoia and locking involved when such a
hash table was accessed, but the lookup should still be efficient and
the overhead might be less significant.

The difference between this scheme and the current one (where there is
a bit that indicates whether an address-based key might be present) is
that you need some degree of exclusive access to the hash table (locking)
before you can even trust that bit not to change out from under you.

If :ADDRESS-BASED defaulted to T, things would be no worse than they are
now; if it was appropriate to say :ADDRESS-BASED NIL explicitly, things
could be made a lot better in that case.  (The :ADDRESS-BASED T case could
be made better than it is, but it will always cost more to deal with that
than to not do so.)

>
>  - Stoney
>
>
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/cgi-bin/mailman/listinfo/openmcl-devel
>
>


_______________________________________________
Openmcl-devel mailing list
Openmcl-devel at clozure.com
http://clozure.com/cgi-bin/mailman/listinfo/openmcl-devel



More information about the Openmcl-devel mailing list