[Openmcl-devel] Fun with Measurements

Sun Oct 29 17:54:39 PST 2006

I looked at this a bit further.  One of the things worth noting is that
the 64-bit versions cons about 10X more garbage then the ppc32 version
does, and there seem to be several culprits:

  - Karatsuba multiplication doesn't cons less, but is often faster thane
    the traditional approach when the arguments are large enough (larger
    than 512 bits each)

  - the ppc32 function which computes the GCD of two bignums makes
    (hopefully stack-allocated) copies of its arguments and destructively
    shifts the bits in those copies in order to implement the binary GCD
    algorithm; the 64-bit versions use the same algorithm, but just use
    ASH (and therefore often conses a lot of intermediate garbage).  The
    destructive version's often quite a bit faster, but it's certainly
    trickier and has been buggy in the past.

To see the difference here, try:

(time (funcall 'gcd 280569676808644557387337788167777653560971888059084158006822626289592295693503468318839239774860901429523738381260905084515263599090701258830281252873096413028081756120863383325257129029917197540937593465326855740635522307754507353652159778101666912259223768820708398085193137786364785044883735847499940079214529769690690867674131976007262950800353219277610038947231578008350182868327007804092805001063990729974223722244260595259464049936627106560616122585111115383088800687974832973908174643062724032495013779819334040275301281649863909976429332274826632666715021507635201987436769268263715272135666802064508114166145396680603126746288342889454743022590805381911974457971352578441243060123830478477200943349837945910921400436415340216320000 14083340311588299327612393445140751809526468780098428770219464072573195107596994221691613394719952407300282086123182338852187176676354645936540309295277222180031260104640292998325634687550131514240164105249849666382075796179266937013538038284753567744))

(the weird "(funcall 'gcd ...) idiom is to discourage the compiler from doing any
constant-folding which might lead us to conclude that this never takes  any time ...)

The ppc32 version claimed to take about a millisecond and consed 112 bytes; the
ppc64 version on the same machine took about 15x longer and consed about 670 K bytes.

I was going to say that there wasn't likely to be a single smoking gun here; there
may not be, but this may be pretty close to it.

  - the ppc32 version sometimes goes to great lengths to avoid calculating both
    the quotient and remainder in things like TRUNCATE when it can tell that only
    one of them will be used; the 64-bit versions tend to compute both results
    and one of them is often ignored.

I did a 64-bit version of CCL::%BIGNUM-BIGNUM-GCD that does destructive shifts
on copies of its arguments; the GCD call above went from 15 milliseconds to
less than 1, and the (COMPUTE-PI-DECIMAL 200) test went from about 1.5 seconds
to about .12 seconds, so (at least in this case) there seems to have been a
smoking gun.

On Sun, 29 Oct 2006, Brent Fulgham wrote:

[...]

> ;; this can be 1e-6 on most compilers, but for COMPUTE-PI-DECIMAL on
> ;; OpenMCL we lose lotsa precision
> (defun fuzzy-eql (a b)
>  (< (abs (/ (- a b) b)) 1e-4))

I don't remember having heard about that; I have no idea whether it's
still true or in what cases it is or was true.

(I do know of one case exposed by the GCL test suite where coercing a
float to a rational and back again loses a bit on x86-64; I imagine
that this comment was written quite a while ago and doubt if that
case is what it's talking about.)