[Openmcl-devel] eql and equal hash tables help please

Gary Byers gb at clozure.com
Thu Jan 17 09:57:33 PST 2013

Hashing a pathname (sanely) involves hashing its components (its name, type, host, directory, ...),
most of which are strings, symbols, or a list of strings and symbols.)  The function NAMESTRING
creates a string from those components (sortof, kindof); hashing a pathname's namestring isn't
(shouldn't be) a substantially different amount of work than hashing the components and combining
those hashes somehow should be.

Anyhow ...

;;; LIST-OF-RANDOM-PATHNAMES generates a list of pathnames with "random" string
;;; components that're supposed to be of typical lengths.
(defvar *random-pathnames* (list-of-random-pathnames 125000))

(defun hash-random-pathnames (&optional (size 30))
   (let* ((hash (make-hash-table :test #'equal :size size)))
     (dolist (elt *random-pathnames* hash)
       (setf (gethash elt hash) elt))))

Last night, I found that HASH-RANDOM-PATHNAMES took a very long time.  I don't think
that I ever waited for it to finish; when I interrupted it, I saw essentially the
same thing that George Khouri reported (excruciatingly slow progress that was
practically indistinguishable from being hung.)  If pathname hashing does a poor
job of generating relatively unique hashes, then GETHASH and (SETF GETHASH) devolve
to linear search: if there are 10000 entries in the table, those functions may need
to compare all 10000 existing entries to their key (with EQUAL on pathnames) before
they find that the key isn't present.  (SETF GETHASH) would add the key, and the
next iteration would have to do 10001 comparisons.

Today's a bright new day, and after a refereshing nap (and hopefully fixing that
brain-damage in the hashing code):

? (time (hash-random-pathnames))
took 270,022 microseconds (0.270022 seconds) to run.
        2,695 microseconds (0.002695 seconds, 1.00%) of which was spent in GC.
During that period, and with 8 available CPU cores,
      268,017 microseconds (0.268017 seconds) were spent in user mode
            0 microseconds (0.000000 seconds) were spent in system mode
  8,548,448 bytes of memory allocated.
  1,530 minor page faults, 0 major page faults, 0 swaps.
#<HASH-TABLE :TEST EQUAL size 124149/151316 #x302002CFD62D>

Note that only 124149 of my 125000 random pathnames are apparently unique under EQUAL.

Dave Cooper suggested that performance might be better if the hash table was created
with a larger :size argument.

? (time (hash-random-pathnames 150000))
took 77,101 microseconds (0.077101 seconds) to run.
During that period, and with 8 available CPU cores,
      76,005 microseconds (0.076005 seconds) were spent in user mode
           0 microseconds (0.000000 seconds) were spent in system mode
  2,823,952 bytes of memory allocated.
#<HASH-TABLE :TEST EQUAL size 124149/150000 #x302002FEE8FD>

Indeed it is.  I got those numbers on a fairly fast machine; YMMV, but I'd expect
a proportional speedup (of several orders of magnitude) on low-end machines, too.
I checked that change into the trunk and if it survives a few days if smoke-testing
will backport it to 1.8.

On Thu, 17 Jan 2013, Robert Goldman wrote:

> On 1/17/13 Jan 17 -9:30 AM, R. Matthew Emerson wrote:
>> On Jan 17, 2013, at 3:32 AM, George Khouri <gk1 at four-four.com> wrote:
>>> namestrings do serve as a workaround, and so does an EQL hash table with pathnames - I just wanted to clarify whether  EQUAL hash tables should work with a large number of pathnames.  CCL stores the first 10k entries in a second or two on my system. It's after that that things aren't pretty.
>> I think that what Gary was trying to point out is that CCL seems to be doing a poor to terrible job of hashing pathname objects.  He gave an example where two (slightly) different pathnames hash to the same code.  At the risk of stating the obvious, hash tables don't tend to work well if there are a lot of hash collisions like this...
>> So, the bad performance that you're seeing may be due to that defect in CCL.  There's nothing wrong in principle with using an equal hash table with a lot of pathname keys.
>> _______________________________________________
>> Openmcl-devel mailing list
>> Openmcl-devel at clozure.com
>> http://clozure.com/mailman/listinfo/openmcl-devel
> Aren't EQUAL and EQUALP hash-tables prone to behave badly in CL in
> general?  It seems like we don't give the compiler much help in figuring
> out how to hash the contents of such tables....
> Would hashing the namestrings work better, because most CL
> implementations have hash functions that work well on strings?
> best,
> r
> _______________________________________________
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
> http://clozure.com/mailman/listinfo/openmcl-devel

More information about the Openmcl-devel mailing list