[Openmcl-devel] Fun with Measurements

Sun Oct 29 20:29:51 PST 2006

On Sun, 29 Oct 2006, Brent Fulgham wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
> On Oct 29, 2006, at 11:45 AM, Brent Fulgham wrote:
>
>> I'm not qualified to assess whether Karatsuba multiplication would
>> provide significant gain.  It might be useful to see if there is a
>> more full-featured BIGNUM test suite we could use to assess the full
>> set of operations.  I'll see what I can find.
>
> I didn't find a good set of BIGNUM tests, but I did find an interesting 
> comparison of Common Lisp floating point implementations:
>
> http://www.cs.berkeley.edu/~mhoemmen/matlisp/floating-point.html
>
> They do a bit of poking and prodding to the system, and reach the conclusion 
> that OpenMCL boxes floating point numbers.
>
> - -Brent

I assume that they reached the conclusion that it "often boxes floating point
numbers unnecessarily" (e.g., that it often boxes an intermediate result only
to turn around and access the unboxed value a few instructions later.)  In
general, all lisp objects are "boxed" : they're machine words that contain
a few bits of type information and either an immediate value or a pointer
to a memory-allocated value.  On a 32-bit machine, there's no room for a
32-bit IEEE-single-float value (and certainly not a 64-bit double float)
and a few type (tag) bits that say "this lisp object is a FLOAT of some kind"
in a single machine word, so boxing either a SINGLE or DOUBLE-FLOAT involves
allocating a few words of memory and returning a tagged pointer to that memory
address.  If this happens a lot, the cost of memory allocation/reference and
of GC can be very significant.)

(One of the benefits of a 64-bit platform is that there's plenty of room in
a 64-bit machine word for a 32-bit single-float and a few tag bits, making
the cost of boxing a single float much smaller.)

In general, a lisp function has to accept boxed arguments and return boxed
result(s).  (You might be able to construct private calling sequences and
return conventions for some functions, but the general case still has to
be supported.)  This is as true of builtin arithmetic functions (+, =, ...)
as it is of user-defined functions, but (unless conservative OPTIMIZE or
NOTINLINE declarations inhibit its doing so) the compiler is licensed to
generate equivalent arithmetic code inline in the former case.

In order to be able to do any better than a function call for an arithmetic
primitive, the compiler generally needs to know something about the types
of the arguments. (In the absence of any knowledge about argument types,
it may be faster to try to exploit fixnum-only cases inline and handle the
general case out-of-line; that works - when it does - because fixnum arithmetic
is often (a) common and (b) cheap.)  If the compiler doesn't know anything
about the types of the arguments to an arithmetic primitive, it generally
can't do anything much better than a function call to a "generic" arithmetic
function (something that probably isn't implemented as a GENERIC-FUNCTION
but conceptually could have been, something that's effectively a large, nested
CASE statement.)  That primitive pretty much has to take boxed arguments and
return boxed result(s), and for some argument/result types that introduces
memory-management overhead.  (It also involves a trip through some nested
CASEs and CONDs, and sometimes involves COERCion of some arguments.)

In a function like FFT (see below), many of the calls to arithmetic
primitives involve operands (and therefore results) of type SINGLE-FLOAT,
and - depending on how calls to things like COS and SIN and AREF and
(SETF AREF) are compiled - very few of those results need to be boxed.
In some cases, determining that an intermediate result doesn't have to
be boxed - when things happen between the point when the result is produced
and the point when it's used - requires at least some framework for flow-
and lifetime-analysis in the compiler and may also require some runtime
support.  OpenMCL's compiler doesn't do a good job of this, and that's
a long-standing (> 20 years) and medium-deep problem that sometimes leeds
to poor FP performance.

I think that much more often poor FP performance in OpenMCL has had to
do with the compiler's failure to recognize operand types (and therefore
fail to inline FP operations and pay both the boxing/boxing overhead and
the "generic" arithmetic overhead.)  Unnecessary boxing certainly costs
something, but it's basically just "creating garbage at a high rate",
and the EGC and large fast caches and other factors probably make this
cost a lot lower than it used to be; in a lot of cases that I've seen,
the cost of wading through a few layers of function call to eventually
get to some sort of FADD instruction - as opposed to simply exectuting
that FADD inline - seems to be much more significant than the memory-
management overhead associated with boxing.

Paying a little bit of attention (as opposed to paying almost no attention)
sometimes yields large dividends.  People who're interested and are able
to use ppc64 versions of OpenMCL might find it interesting to compare the
FFT times for 1.0 and the recent 1.1 snapshots.  There's clearly still a
lot of room for improvement there (more stuff is inlined, but sometimes
not inlined very well ...), and the fact that boxing SINGLE-FLOATs on
ppc64 doesn't involve memory-management overhead makes it easier to see
how signigicant other factors can be.