[Openmcl-devel] Address-based hashing considered harmful. Maybe.

Sat Mar 1 05:06:08 PST 2003

The term "address-based hashing" refers to the practive of using a
hash table key's (current) address to derive an integer "hash" value
that can be used to probe a hash table for that key.  It's desirable
that different objects hash to different hash values and imperative
that the same object consistently hash to the "same" hash value when
the hash-table test depends on the identity of the key or some component
of it.

The key transformation is very simple (just scrambling the bits in
the key's address) and the low bits of object addresses tend to be
fairly evenly distributed, so the actual lookup in cases where
address-based hashing is used tends to be very efficient.

Since addresses can change (as a result of GC activity), this
efficient probing tends to be outweighed by associated overhead:
it's necessary to inhibit GC activity during critical sections
of hash-table code, it's necessary to deal with the case where
a hash table becomes invalid because of key movement (and therefore
has to be "rehashed", which can be a very expensive operation.)
Further, these concerns make it difficult to efficiently synchronize
concurrent access to hash tables.

In cases where the identity of the key or a component of the key
determines the value of the hash-table-test, it isn't correct
to derive the hash value from some mutable attribute of the key
(those mutable attributes don't determine the object's identity
and identity determines equivalence.)  In many interesting cases,
lisp objects have immutable attributes that could be used to
hash them into well-distributed hash values.

Numbers are immutable, so the bits that make up a number can
be scrambled.

Symbols have immutable pnames, so the bits in the characters in
those pnames can be used to derive a hash value.

It would not be unreasonable to add a (hidden) field to structure
instances, CLOS objects, and built-in structure-like objects (packages,
hash-tables, ...); this field could be initialized when the object's
created to something relatively unique and never subsequently modified.

-Most- functions have unique code vectors associated with them; the
exceptions (generic functions, interpreted functions) could  have
an extra "unique id" field.

That seems to leave arrays and cons cells as objects without handy
immutable attributes that can be used to derive good hash values.
If we were to get rid of the concept of address-based hashing
in OpenMCL and chose not to add "unique ID" fields to cons cells
and arrays, EQ-based hash lookup on those types of keys would
degenerate to linear search.

Does anyone use cons cells or arrays/strings as keys in EQ/EQL
hash tables (or as components of keys in other other types of
tables where the test-function depends on the identity of those
components) ?  Does anyone know of any such code in common use ?

At this point, I tend to see substantial gains in performance in
the general case as being worth horrible performance in the case
where conses/arrays are used as EQ/EQL hash table keys, but that's
based on the assumption that the latter case is exceedingly rare
and seldom useful.

_______________________________________________
Openmcl-devel mailing list
Openmcl-devel at clozure.com
http://clozure.com/cgi-bin/mailman/listinfo/openmcl-devel