[Openmcl-devel] Bignum speed

Thu Oct 16 22:23:01 PDT 2008

On Fri, 17 Oct 2008, Waldek Hebisch wrote:

> Gary Byers wrote:
>> Sorry for not responding sooner.
>>
>> "Plans" in the sense of "in N weeks, someone plans to do X, Y, and Z
>> and therefore speed things up by M percent" ?  I don't know of any
>> such plans, no.
>>
>> "Plans" in the sense of many people thinking "someday, someone should
>> take the time to do X, Y, and Z, and therefore speed things up ..." ?
>> Sure.
>>
>> Designing and implementing really good bignum code takes significant
>> time.  Taking bignum code that's often pretty bad (worse than it seems
>> that it should be ...) and making it "not so bad" or "not bad at all"
>> or "maybe kind of good" takes far less time.  In some cases, one would
>> hit a wall where there's no more low-hanging fruit left and further
>> improvements involve better algorithms (or better support for doing
>> unboxed arithmetic, or ...), but I don't think that we're particularly
>> close to that wall, or that anyone's put any effort/time into getting
>> there.
>>
>> I don't know exactly how to quantify what it would mean to go from
>> where things are now to where we could get with a moderate amount of
>> effort; I've poked at some of the code in the last few days and seem
>> to have gotten some improvement (maybe 2X-3X in some cases, though
>> it's hard to know if those cases are representative and - since memory
>> allocation/GC costs factor into this - should note that those results
>> weren't obtained in a controlled environnent.  (The kinds of stuff
>> that I did just involved the obvious first steps one might take in
>> trying to improve code: removing function call overhead, declaring the
>> types of things that're heavily used and non-obvious, removing
>> unnecesary typechecking, etc.  Again, there's a limit to how far this
>> approach can take you; again, it's not surprising that doing obvious
>> and simple things to improve performance ...  improves performance, at
>> least some.)
>>
>> Two things that someone interested in helping CCL's bignum performance
>> could provide that'd help are:
>>
>> 1) A fistful of cash.  (Hey, I had to mention it.)  We -do- do work
>> on CCL that isn't directly funded, but a significant amount of work
>> (multiple person-months) is much more likely to happen if it's funded
>> than if it isn't.
>>
>
> FriCAS is an unpaid project, so no cash from here... OTOH I do not
> think multiple person-months are needed.  Current performance leader
> for large integer calculations is GMP library -- it gets performance
> via combination of advanced algorithms and tuned assembler code
> (for several popular architectures).  Independently getting
> comparable performance is a big effort.  But calling GMP via
> FFI should not be that hard.  Alternatively, getting some
> intermediate level of performance would make huge improvement
> and is easier.
>
>> 2) Concrete examples of things where performance is bad.  (It may
>> be the case that things are just bad across the board; I don't know.)
>> The simpler it is to reproduce a performance problem (or an outright
>> bug, for that matter), the more time someone can put into trying to
>> improve/fix the problem.
>>
>
> I mean multiplication of large numbers.  Here size matters.  First
> range is just above fixnums: it is important because such numbers
> are relatively frequent.  The second is between tens and thousends
> of digits.  Such numberes are less frequent but routinely appear
> in intermediate steps in some problems where all input and output is
> in fixnum range.  Finally, there are huge numbers, million digits
> or more.  Huge numers are more specialized than merely large ones:
> natural problems which small numbers as input that need million digit
> numbers in intermediate steps are expected to need lot of them, so
> to run out memory anyway.
>
> A little benchmark below.  To make time easier to measure
> dot-product adds all products of elements of two lists.  To
> avoid possiblity of trivial optimizations all numbers are
> different, products are added, result is assigned to a variable.
> I make sure that result is not printed, because printing is
> usually quite time consuming and could change timing in undesired way.
>
>
> (setf a (expt 9 1000))
> (defun dot-product (l1 l2)
>    (let ((s 0)
>          (ll1 l1))
>        (loop
>            (cond
>                ((null l2)
>                    (return-from dot-product s))
>                ((null ll1)
>                    (setf ll1 l1)
>                    (setf l2 (cdr l2)))
>                (t
>                    (setf s (+ s (* (car ll1) (car l2))))
>                    (setf ll1 (cdr ll1)))))))
>
> (compile 'dot-product)
>
> (setf l1 nil)
> (dotimes (i 100) (setf l1 (cons (- a i) l1)))
> (setf l2 nil)
> (dotimes (i 100) (setf l2 (cons (+ i a)  l2)))
>
> (time (progn (setf ress (dot-product l1 l2)) nil))
>
> Using Clozure CL 1.2-r10552 on 2.4GHz Core Duo time is  2.200137s
> On the same machine sbcl needs 0.104s, clisp 0.297694 s, gcl 0.09s
>
> For larger numbers ratio goes even worse.

For what it's worth, 32-bit CCL on a 2.1GHz PPC taks around .5
seconds.  Still not good, but it makes the x86-64 result look
espcially bad.

The 32-bit x86 CCL (not released yet) takes .295 seconds on the same
machine (similar to yours.)  (That's enough of  a difference to
make me wonder whether the x86-64 version isn't grossly broken
as well as naively/casually implemented.)

Calling out to GMP's relatively high-level "mpz" integer routines
(which involves a lot of copying in CCL because GC issues, though
probably not as much copying as I did ...) takes around .180 seconds

I didn't try using the mpn functions, which might be a little harder
to use but would likely be faster still.  (There'd still be some
copying required and some foreign function call overhead involved.)
I don't know how much faster it would be than using the mpz layer
is.

As a short-term fix, calling out to GMP is undoubtedly the best way of
getting things >10X faster (and presumably getting them to scale
better.)  We can do something to make this easy somehow - offer some
way nfor people who have GMP installed to persuade the lisp to call
out to GMP when there's a benefit to doing so, as there clearly is in
cases like your example.  It's not very attractive to make the lisp
depend on GMP: CCL can build itself in well under a minute on most
machines and has few external dependencies, and it's not very
attractive to change that.

Right now, calling out to GMP wins despite the copying and FFI
overhead.  Even though I'm sure some of that copying can be
eliminated, it's always going to be necesary to do -some-, and in the
longer run that's probably a limiting factor.

Long ago, whenever I was chastized for sleeping through high school
math class, I asked "why not ? Why would anyone need to learn this
stuff ?"  I never got a convincing answer at the time, so it's a
little weird to be finding out now ...

>
> --
>                              Waldek Hebisch
> hebisch at math.uni.wroc.pl
>
>