[Openmcl-devel] OpenGL Performance PPC vs Intel-Mac 32/64

Wed Mar 18 16:55:58 PDT 2009

I hope that most people will find this all totally uninteresting; sorry
that it's so long.

On Wed, 18 Mar 2009, Alexander Repenning wrote:

>
> On Mar 18, 2009, at 4:09 AM, Gary Byers wrote:
>
>> <http://clozure.com/pipermail/openmcl-devel/2009-January/008832.html>
>> tried to answer the same question when it was asked a few weeks later.
>
>
> this one mostly talks about the difference between PPC and 86-64.

No, it doesn't.  A lot of time and effort went into explaining where
the foreign function call overhead on x96-64 Darwin/MacOSX CCL comes
from and what the tradeoffs were and are in that message, after having
explained that more tersely once before.

If neither Matt nor I can explain this to you in a way that makes
sense, it's possible that it just never will.  I'll try one last
time, but if that still doesn't work, please accept that:

1) we understand that this is the case (i.e., that there's more FF-call
    overhead on x86-64 Darwin than on other platforms.)
2) we (I at least) view the root cause of this as being an OS limitation;
    the Linux, FreeBSD, Solaris, and Win64 x86-64 ports do not have this ff-call
    overhead (though the Win64 port is likely impacted in other ways by
    a similar OS limitation.)
3) we (I) believe that the workaround for that OS limitation is the
    better of two bad choices.

Even if you don't understand all of the issues, I'm sure that you
understand that no one would deliberately introduce overhead to
ff-call unless they believed that the alternative to doing so was
worse.

> I 
> understand the "one register short" argument between 86-64 and PPC. Do I get 
> this right: in essence the 86-64 ff call requires 2 additional system calls 
> to compensate for missing PPC RISC registers, right? But how come the 86-32 
> is faster? How does the register  versus need for syscalls tradeoff change in 
> 32bit assuming there are no additional registers available in 32bit? Does the 
> 32bit call use 64bit register to store 2 x 32bit info and therefore double in 
> essence the number of registers, which is turn, allows it to avoid the 
> syscalls?

No.

Here (indented slightly) are what I think are the relevant parts
of that message.  None of this has anything to do with the PPC.

  OSX on x8664 doesn't allow a user-mode process to use an
  otherwise-unused segment register for its own purposes (e.g., keeping
  track of thread-local data.)  Linux, FreeBSD, and Solaris all provide
  this functionality; Win64 doesn't ("Windows sucks, and everyone
  understands how.")  Because of this deficiency, the choices are
  basically:

  - keep lisp thread data in a general-purpose register, which might
  negatively affect the performance of lots of things.  - "share" the
  segment register that the OS uses for C thread data, switching it
  between C data and Lisp data on foreign function calls and callbacks
  (and therfore slowing down foreign function calls and callbacks by
  the cost of 2 system calls.)

  I'd rather not do either of these things.  If one assumes that many
  foreign function calls are themselves fairly expensive operations,
  then adding additional overhead to foreign function calls seemed more
  attractive (er, um, "less unattractive") than adding some (variable)
  amount of overhead to lots of lisp functions.  It's true that some
  foreign function calls don't do much of anything, and the syscall
  overhead on foreign function calls might dwarf the actual cost of the
  operation (this is likely true of #_glVertex3f, if eliminating
  function call overhead yields visible results, then it seems likely
  that #_glVertex3f isn't "computationally expensive" in any sense.)

  One could reasonably argue that it would have been better to make the
  other unattractive choice (this was done on win64, where it wasn't
  clear that there was even something as ugly as the Darwin hack.)
  That might be correct (I'm a little skeptical, and the win64 machine
  that I use is slower than other machines, making comparisons
  difficult; I don't know.)  In any case, the fact that foreign
  function calls are have more overhead on x8664 Darwin than they do on
  other platforms isn't some Big Unsolved Mystery; it has something to
  do with the fact that the OS is effectively leaving us a register
  short (and with what I thought was the best way to deal with that.)

Implicit in all of this is a basic understanding of x86-64 architecture
(doesn't everyone have such an understanding ?)

- the x86-64 provides 16 general-purpose registers.   Some are more
   more general-purpise than others.  That's not a huge number (though
   I'll get no sympathy from rme for saying that), and if some
   random OS said something stupid and implausible - like "we've
   decided that context switch would be even faster if we didn't
   save and restore r15, so don't use r15 in application code",
   then having one less register to work with would likely have
   some measurable negative effect on lots of code.)

   A 64-bit architecture consumes memory (including cache memory)
   roughly twice as fast as a 32-bit processor does, so it's
   often the case that - all other things being equal - 64-bit
   programs often run slower than 32-bit equivalnets (on the PPC,
   SPARC, MIPS, etc.)  The 16 GPRs that the x86-64 provides is twice
   as many as the 32-bit x86 offered, and compilers' ability to
   utilize more registers (relative to the 8 offered by 32-bit x86)
   often overcomes these cache effects; one often sees claims that
   64-bit x86 programs run ~15% faster than the 32-bit equivalents
   on average.  If that number's reasonably accurate, then we might
   assume that the cost of losing one of the 8 new GPRs to Imaginary
   Dumb OS (IDOS) would be a few percent on average (roughly 1/8
   of the difference between how much slower the 64-bit machine
   would be because it eats cache lines twice as quickly and how
   much faster it is because it has more registers available.)
   The difference betwen 9 GPRs and 8 is probably greater than
   the difference between 16 and 15; I don't really know how to
   realistically calculate the cost of losing a GPR to IDOS,
   but saying "a few percent" is probably vague enough to be correct.

- for historical reasons, it also provides 6 "segment registers".
   4 of these segment registers have architecurally imposed meaning;
   even though segmented addresssing isn't really used, the 4
   legacy segment registers - %cs, %ds, %es, and %ss - can be used by the
   OS to provide some forms of memory protection (and can be used to call
   privileged code from   non-privileged code in some cases.)  On modern
   x86 OSes, linear (32- or 64-bit) addresses are still (at some level)
   relative to some segment register, but all segment registers (or
   at least the code, data, and stack segement registers) reference
   the same (linear) addresses, possibly with different protection
   attributes.

- Since everyone hated segment registers so much, when Intel introduced
   the i386 (which allowed linear 32-bit addressing) introduced two
   new segment registers (%fs and %gs).  These new segment registers
   didn't have any architecturally-dictated semantics associated with
   them; they could be used to address memory.  If it was arranged
   (sometimes with the help of the OS) that segment register %fs
   pointed to linear address #x1000, then the segmented address %fs:0
   was equivalent to the linear address #x1000.  Some applications may
   have used the new segment registers to access their data (and on
   machines that offered few GPRs dedicating a segment register for
   that purpose was preferable to dedicating a GPR).  People who
   mocked the x86 viewed this whole development as an excuse to do
   so more actively, since the machine clearly needed more GPRs and
   the extra segment registers were the wrong thing.
   As OS-level threads became more widely used, x86-based OSes began
   using one one of the "new" extra segment registers to address
   thread-specific data.  (So if an OS used %fs for thread-local storage,
   the value of a thread-local variable - "errno", the C library error
   return value - would be accessible at a fixed location in relative
   to %fs, and each thread's code could sanely set and test the value
   of its "errno" variable without worrying about what value that variable
   had in other threads.
   Whichever of %fs or %gs wasn't used by the threads runtime was generally
   made available to the application. Getting a segment register to point
   to a particular linear address is a privileged operation, and every
   32-bit OS that I can think of provided some (possibly obscure/baroque/
   undocumented) way of doing that.  (The mechanism that Windows uses
   is notorious for being casual about checking its arguments and can
   and has been (ab)used to do malicious things.)

- According to legend, when AMD first introduced the x86-64 architecture,
   they'd planned to get rid of %fs and %gs segment registers (since they'd
   now be offering a somewhat more reasonable number of GPRs.  Supposedly,
   Microsoft persusaded them not to, presumably because they (and possibly
   their customers) had grown used to the idea of using the segment registers
   to access thread-local data, and didn't want to burn/dedicate GPRs
   for this purpose (since there is real reason to believe that that
   would lead to an overall performance decrease.)

- Back in the present (at last 2-3 years ago) and back in context.
   Lisp thread-local data isn't the same as C thread-local storage;
   fortunately, most 64-bit OSes provide a documented, supported
   way of setting up whichever of %fs/%gs isn't being used by the
   threads library for application-specific thread-local storage,
   and CCL uses the provided mechanism on x86-64 versions on Linux,
   FreeBSD, and Solaris.

   Win64 doesn't seem to provide any means of doing this (at the
   very least, we couldn't find any way of doing it.)  We had no
   choice but to have each thread keep thread-local data in a GPR
   and lose that "few percent" of performance across the board.
   I have no idea how Win64 sets up the segment register that
   it uses for its thread APIS' use; as far as I can tell, this
   happens during thread creation in the OS kernel.

   OSX on x86-64 doesn't provide any way of setting up the unused
   segment register (which happens to be %fs) so that it can be
   used to address application-specifc thread-local data.  I've
   reported this as a bug and I'm sure that other people have as
   well.  I don't know why they didn't: whether it's something
   they've thought about and decided not to address, whether they
   haven't really thought about it, whether there are technical limitations
   in the current (32-bit) OS kernel that make it prohibitively difficult,
   whether all of the people that could implement it are busy working
   on the iPhone, or what the issue is.

   Apple does expose (slightly and in some sense) the mechanism
   that the threads library uses to set up the segment register
   that it uses.  That means that (since lisp code has no interest
   in quickly addressing addressing C thread library data but
   C code expects the %gs register to address that data) we have
   an alternative to simply accepting the "few percent" across-the-board
   performance hit we'd take if we just burned a GPR: we can do a system
   call to change where %gs pointes whenever we transition between
   lisp and C code.(foreign function calls, callbacks, during exception
   processing.)  Those system calls are somewhat expensive (anywhere
   from hundreds of nanoseconds to a few microseonds); that overhead
   is likely negligible in many contexts (doing dixk or network I/O)
   and is likely pronounced in other contexts (#_glVertex3f, surely;
   perpaps other things.)

So, the fact that #_glVertex3F is slower than it could/should be is
not desirable but is not suprising.

That's my last attempt to explain this.

> alex
>
>> 
>> On Wed, 18 Mar 2009, R. Matthew Emerson wrote:
>> 
>>> 
>>> On Mar 18, 2009, at 2:16 AM, Alexander Repenning wrote:
>>> 
>>>> Playing around with some direct mode (i.e., not super efficient)
>>>> OpenGL code we notice some big speed differences:
>>>> 
>>>> The time to draw a cube (4 sides only actually):
>>>> 
>>>> 1.66 Ghz PPC Mac:   MCL 88 us,  CCL 1.3 60us
>>>> 
>>>> 2.6 Ghz Intel Mac: CCL 1.3 32bit: 33us  CCL 1.3 64bit 92us
>>>> 
>>>> 
>>>> I remember we had a discussion on PPC versus Intel call overhead but I
>>>> am a bit surprised about 64bit Intel Mac CCL on a much faster machine,
>>>> even ignoring the faster GPU, is slower than MCL on the old PPC.
>>>> Notice, the OpenGL commands are pretty trivial. Most of the time is
>>>> spent in the foreign function call overhead. How can CCL 32bit Intel
>>>> Mac be  3x faster than 64bit ?
>>> 
>>> The first several paragraphs of the following message talk about this.
>>> 
>>> http://clozure.com/pipermail/openmcl-devel/2008-December/008766.html
>>> 
>>> In summary, due to missing functionality on Darwin/x8664 (no way to
>>> set the fsbase MSR) we resort to a workaround that involves performing
>>> a system call before and after each ff-call.  When the foreign
>>> function doesn't do a lot of work, this overhead is significant.
>>> (When the foreign code does something expensive, like I/O, the
>>> overhead matters less.)
>>> 
>>> On Darwin/x8632, we don't have to do the workaround, so ff-calls are
>>> cheaper.
>>> 
>>> _______________________________________________
>>> Openmcl-devel mailing list
>>> Openmcl-devel at clozure.com
>>> http://clozure.com/mailman/listinfo/openmcl-devel
>>> 
>>> 
>> _______________________________________________
>> Openmcl-devel mailing list
>> Openmcl-devel at clozure.com
>> http://clozure.com/mailman/listinfo/openmcl-devel
>> 
>
> Prof. Alexander Repenning
>
> University of Colorado
> Computer Science Department
> Boulder, CO 80309-430
>
> vCard: http://www.cs.colorado.edu/~ralex/AlexanderRepenning.vcf
>
>