[Openmcl-devel] Bignum speed

Thu Oct 16 23:17:38 PDT 2008

Some comments:

1. It would be interesting to know why the 64 bit version is so slow. It makes you wonder what else might be affected besides bignums? That 10X differnce in speed is very scary. Fell like memory system or cache issues to me.

2. If GMP is really a 10X speed-up with the copying and there could be a switch to dynamic bind it, that would be a big win for CCL IMHO. However I wouldn't do it until I understood the 10X issue?

3. I tried playing around with the Scheme-Chicken (some nice work there) and tried to build some of the packages which they call "eggs". I had lots of problems getting it to work right with either the 32 bit or 64 bit version of GMP. I also found it hard to build 32 bit GMP on a 64 bit system.

4. While it's nice to have the 32 bit version of CCL, all the CPUs for the last few years and going forward are 64 bit and all the optimization should be for the 64 bit machines.

5. Anyone have any idea of the distribution of ranges of bignums? As previously mentioned I suspect there are many cases where the range isn't much bigger than 128 or 256 bits.

6. You can write some very fast code using SSE for 128 bits and it should dispatch and run in parallel with other non-SSE instructions.

7. Intel might have some very fast code that could do some of this, that could also be dynamically linked. Intel Intrinsics come to mind, but they might be tied to the Intel C compiler (which is available for X-Code now). They might have there own extended precision stuff. It's worth asking.

8. Intel is such a big place it can be hard to navigate. Kind of like they'd be happy to give it to you if they knew they had it or could find it. If Clozure isn't an Intel developer, I'd suggest it. It gives you a way to to access stuff like this.

9. Has anyone every run CCL under Intel's VTune?

Best,

leb

At 11:23 PM -0600 10/16/08, Gary Byers wrote:
>On Fri, 17 Oct 2008, Waldek Hebisch wrote:
>
>> Gary Byers wrote:
>>> Sorry for not responding sooner.
>>>
>>> "Plans" in the sense of "in N weeks, someone plans to do X, Y, and Z
>>> and therefore speed things up by M percent" ?  I don't know of any
>>> such plans, no.
>>>
>>> "Plans" in the sense of many people thinking "someday, someone should
>>> take the time to do X, Y, and Z, and therefore speed things up ..." ?
>>> Sure.
>>>
>>> Designing and implementing really good bignum code takes significant
>>> time.  Taking bignum code that's often pretty bad (worse than it seems
>>> that it should be ...) and making it "not so bad" or "not bad at all"
>>> or "maybe kind of good" takes far less time.  In some cases, one would
>>> hit a wall where there's no more low-hanging fruit left and further
>>> improvements involve better algorithms (or better support for doing
>>> unboxed arithmetic, or ...), but I don't think that we're particularly
>>> close to that wall, or that anyone's put any effort/time into getting
>>> there.
>>>
>>> I don't know exactly how to quantify what it would mean to go from
>>> where things are now to where we could get with a moderate amount of
>>> effort; I've poked at some of the code in the last few days and seem
>>> to have gotten some improvement (maybe 2X-3X in some cases, though
>>> it's hard to know if those cases are representative and - since memory
>>> allocation/GC costs factor into this - should note that those results
>>> weren't obtained in a controlled environnent.  (The kinds of stuff
>>> that I did just involved the obvious first steps one might take in
>>> trying to improve code: removing function call overhead, declaring the
>>> types of things that're heavily used and non-obvious, removing
>>> unnecesary typechecking, etc.  Again, there's a limit to how far this
>>> approach can take you; again, it's not surprising that doing obvious
>>> and simple things to improve performance ...  improves performance, at
>>> least some.)
>>>
>>> Two things that someone interested in helping CCL's bignum performance
> >> could provide that'd help are:
>>>
>>> 1) A fistful of cash.  (Hey, I had to mention it.)  We -do- do work
>>> on CCL that isn't directly funded, but a significant amount of work
>>> (multiple person-months) is much more likely to happen if it's funded
>>> than if it isn't.
>>>
>>
>> FriCAS is an unpaid project, so no cash from here... OTOH I do not
>> think multiple person-months are needed.  Current performance leader
>> for large integer calculations is GMP library -- it gets performance
>> via combination of advanced algorithms and tuned assembler code
>> (for several popular architectures).  Independently getting
>> comparable performance is a big effort.  But calling GMP via
>> FFI should not be that hard.  Alternatively, getting some
>> intermediate level of performance would make huge improvement
>> and is easier.
>>
>>> 2) Concrete examples of things where performance is bad.  (It may
>>> be the case that things are just bad across the board; I don't know.)
> >> The simpler it is to reproduce a performance problem (or an outright
>>> bug, for that matter), the more time someone can put into trying to
>>> improve/fix the problem.
>>>
>>
>> I mean multiplication of large numbers.  Here size matters.  First
>> range is just above fixnums: it is important because such numbers
>> are relatively frequent.  The second is between tens and thousends
>> of digits.  Such numberes are less frequent but routinely appear
>> in intermediate steps in some problems where all input and output is
>> in fixnum range.  Finally, there are huge numbers, million digits
>> or more.  Huge numers are more specialized than merely large ones:
>> natural problems which small numbers as input that need million digit
>> numbers in intermediate steps are expected to need lot of them, so
>> to run out memory anyway.
>>
>> A little benchmark below.  To make time easier to measure
>> dot-product adds all products of elements of two lists.  To
>> avoid possiblity of trivial optimizations all numbers are
>> different, products are added, result is assigned to a variable.
>> I make sure that result is not printed, because printing is
>> usually quite time consuming and could change timing in undesired way.
>>
>>
>> (setf a (expt 9 1000))
>> (defun dot-product (l1 l2)
>>    (let ((s 0)
>>          (ll1 l1))
>>        (loop
>>            (cond
>>                ((null l2)
>>                    (return-from dot-product s))
>>                ((null ll1)
>>                    (setf ll1 l1)
>>                    (setf l2 (cdr l2)))
>>                (t
>>                    (setf s (+ s (* (car ll1) (car l2))))
>>                    (setf ll1 (cdr ll1)))))))
>>
>> (compile 'dot-product)
>>
>> (setf l1 nil)
>> (dotimes (i 100) (setf l1 (cons (- a i) l1)))
>> (setf l2 nil)
>> (dotimes (i 100) (setf l2 (cons (+ i a)  l2)))
>>
>> (time (progn (setf ress (dot-product l1 l2)) nil))
>>
>> Using Clozure CL 1.2-r10552 on 2.4GHz Core Duo time is  2.200137s
>> On the same machine sbcl needs 0.104s, clisp 0.297694 s, gcl 0.09s
>>
>> For larger numbers ratio goes even worse.
>
>
>For what it's worth, 32-bit CCL on a 2.1GHz PPC taks around .5
>seconds.  Still not good, but it makes the x86-64 result look
>espcially bad.
>
>The 32-bit x86 CCL (not released yet) takes .295 seconds on the same
>machine (similar to yours.)  (That's enough of  a difference to
>make me wonder whether the x86-64 version isn't grossly broken
>as well as naively/casually implemented.)
>
>Calling out to GMP's relatively high-level "mpz" integer routines
>(which involves a lot of copying in CCL because GC issues, though
>probably not as much copying as I did ...) takes around .180 seconds
>
>I didn't try using the mpn functions, which might be a little harder
>to use but would likely be faster still.  (There'd still be some
>copying required and some foreign function call overhead involved.)
>I don't know how much faster it would be than using the mpz layer
>is.
>
>As a short-term fix, calling out to GMP is undoubtedly the best way of
>getting things >10X faster (and presumably getting them to scale
>better.)  We can do something to make this easy somehow - offer some
>way nfor people who have GMP installed to persuade the lisp to call
>out to GMP when there's a benefit to doing so, as there clearly is in
>cases like your example.  It's not very attractive to make the lisp
>depend on GMP: CCL can build itself in well under a minute on most
>machines and has few external dependencies, and it's not very
>attractive to change that.
>
>Right now, calling out to GMP wins despite the copying and FFI
>overhead.  Even though I'm sure some of that copying can be
>eliminated, it's always going to be necesary to do -some-, and in the
>longer run that's probably a limiting factor.
>
>Long ago, whenever I was chastized for sleeping through high school
>math class, I asked "why not ? Why would anyone need to learn this
>stuff ?"  I never got a convincing answer at the time, so it's a
>little weird to be finding out now ...
>
>
>
>>
>> --
>>                              Waldek Hebisch
>> hebisch at math.uni.wroc.pl
>>
>>
>_______________________________________________
>Openmcl-devel mailing list
>Openmcl-devel at clozure.com
>http://clozure.com/mailman/listinfo/openmcl-devel