[Openmcl-devel] OpenGL Performance PPC vs Intel-Mac 32/64
Gary Byers
gb at clozure.com
Wed Mar 18 16:55:58 PDT 2009
I hope that most people will find this all totally uninteresting; sorry
that it's so long.
On Wed, 18 Mar 2009, Alexander Repenning wrote:
>
> On Mar 18, 2009, at 4:09 AM, Gary Byers wrote:
>
>> <http://clozure.com/pipermail/openmcl-devel/2009-January/008832.html>
>> tried to answer the same question when it was asked a few weeks later.
>
>
> this one mostly talks about the difference between PPC and 86-64.
No, it doesn't. A lot of time and effort went into explaining where
the foreign function call overhead on x96-64 Darwin/MacOSX CCL comes
from and what the tradeoffs were and are in that message, after having
explained that more tersely once before.
If neither Matt nor I can explain this to you in a way that makes
sense, it's possible that it just never will. I'll try one last
time, but if that still doesn't work, please accept that:
1) we understand that this is the case (i.e., that there's more FF-call
overhead on x86-64 Darwin than on other platforms.)
2) we (I at least) view the root cause of this as being an OS limitation;
the Linux, FreeBSD, Solaris, and Win64 x86-64 ports do not have this ff-call
overhead (though the Win64 port is likely impacted in other ways by
a similar OS limitation.)
3) we (I) believe that the workaround for that OS limitation is the
better of two bad choices.
Even if you don't understand all of the issues, I'm sure that you
understand that no one would deliberately introduce overhead to
ff-call unless they believed that the alternative to doing so was
worse.
> I
> understand the "one register short" argument between 86-64 and PPC. Do I get
> this right: in essence the 86-64 ff call requires 2 additional system calls
> to compensate for missing PPC RISC registers, right? But how come the 86-32
> is faster? How does the register versus need for syscalls tradeoff change in
> 32bit assuming there are no additional registers available in 32bit? Does the
> 32bit call use 64bit register to store 2 x 32bit info and therefore double in
> essence the number of registers, which is turn, allows it to avoid the
> syscalls?
No.
Here (indented slightly) are what I think are the relevant parts
of that message. None of this has anything to do with the PPC.
OSX on x8664 doesn't allow a user-mode process to use an
otherwise-unused segment register for its own purposes (e.g., keeping
track of thread-local data.) Linux, FreeBSD, and Solaris all provide
this functionality; Win64 doesn't ("Windows sucks, and everyone
understands how.") Because of this deficiency, the choices are
basically:
- keep lisp thread data in a general-purpose register, which might
negatively affect the performance of lots of things. - "share" the
segment register that the OS uses for C thread data, switching it
between C data and Lisp data on foreign function calls and callbacks
(and therfore slowing down foreign function calls and callbacks by
the cost of 2 system calls.)
I'd rather not do either of these things. If one assumes that many
foreign function calls are themselves fairly expensive operations,
then adding additional overhead to foreign function calls seemed more
attractive (er, um, "less unattractive") than adding some (variable)
amount of overhead to lots of lisp functions. It's true that some
foreign function calls don't do much of anything, and the syscall
overhead on foreign function calls might dwarf the actual cost of the
operation (this is likely true of #_glVertex3f, if eliminating
function call overhead yields visible results, then it seems likely
that #_glVertex3f isn't "computationally expensive" in any sense.)
One could reasonably argue that it would have been better to make the
other unattractive choice (this was done on win64, where it wasn't
clear that there was even something as ugly as the Darwin hack.)
That might be correct (I'm a little skeptical, and the win64 machine
that I use is slower than other machines, making comparisons
difficult; I don't know.) In any case, the fact that foreign
function calls are have more overhead on x8664 Darwin than they do on
other platforms isn't some Big Unsolved Mystery; it has something to
do with the fact that the OS is effectively leaving us a register
short (and with what I thought was the best way to deal with that.)
Implicit in all of this is a basic understanding of x86-64 architecture
(doesn't everyone have such an understanding ?)
- the x86-64 provides 16 general-purpose registers. Some are more
more general-purpise than others. That's not a huge number (though
I'll get no sympathy from rme for saying that), and if some
random OS said something stupid and implausible - like "we've
decided that context switch would be even faster if we didn't
save and restore r15, so don't use r15 in application code",
then having one less register to work with would likely have
some measurable negative effect on lots of code.)
A 64-bit architecture consumes memory (including cache memory)
roughly twice as fast as a 32-bit processor does, so it's
often the case that - all other things being equal - 64-bit
programs often run slower than 32-bit equivalnets (on the PPC,
SPARC, MIPS, etc.) The 16 GPRs that the x86-64 provides is twice
as many as the 32-bit x86 offered, and compilers' ability to
utilize more registers (relative to the 8 offered by 32-bit x86)
often overcomes these cache effects; one often sees claims that
64-bit x86 programs run ~15% faster than the 32-bit equivalents
on average. If that number's reasonably accurate, then we might
assume that the cost of losing one of the 8 new GPRs to Imaginary
Dumb OS (IDOS) would be a few percent on average (roughly 1/8
of the difference between how much slower the 64-bit machine
would be because it eats cache lines twice as quickly and how
much faster it is because it has more registers available.)
The difference betwen 9 GPRs and 8 is probably greater than
the difference between 16 and 15; I don't really know how to
realistically calculate the cost of losing a GPR to IDOS,
but saying "a few percent" is probably vague enough to be correct.
- for historical reasons, it also provides 6 "segment registers".
4 of these segment registers have architecurally imposed meaning;
even though segmented addresssing isn't really used, the 4
legacy segment registers - %cs, %ds, %es, and %ss - can be used by the
OS to provide some forms of memory protection (and can be used to call
privileged code from non-privileged code in some cases.) On modern
x86 OSes, linear (32- or 64-bit) addresses are still (at some level)
relative to some segment register, but all segment registers (or
at least the code, data, and stack segement registers) reference
the same (linear) addresses, possibly with different protection
attributes.
- Since everyone hated segment registers so much, when Intel introduced
the i386 (which allowed linear 32-bit addressing) introduced two
new segment registers (%fs and %gs). These new segment registers
didn't have any architecturally-dictated semantics associated with
them; they could be used to address memory. If it was arranged
(sometimes with the help of the OS) that segment register %fs
pointed to linear address #x1000, then the segmented address %fs:0
was equivalent to the linear address #x1000. Some applications may
have used the new segment registers to access their data (and on
machines that offered few GPRs dedicating a segment register for
that purpose was preferable to dedicating a GPR). People who
mocked the x86 viewed this whole development as an excuse to do
so more actively, since the machine clearly needed more GPRs and
the extra segment registers were the wrong thing.
As OS-level threads became more widely used, x86-based OSes began
using one one of the "new" extra segment registers to address
thread-specific data. (So if an OS used %fs for thread-local storage,
the value of a thread-local variable - "errno", the C library error
return value - would be accessible at a fixed location in relative
to %fs, and each thread's code could sanely set and test the value
of its "errno" variable without worrying about what value that variable
had in other threads.
Whichever of %fs or %gs wasn't used by the threads runtime was generally
made available to the application. Getting a segment register to point
to a particular linear address is a privileged operation, and every
32-bit OS that I can think of provided some (possibly obscure/baroque/
undocumented) way of doing that. (The mechanism that Windows uses
is notorious for being casual about checking its arguments and can
and has been (ab)used to do malicious things.)
- According to legend, when AMD first introduced the x86-64 architecture,
they'd planned to get rid of %fs and %gs segment registers (since they'd
now be offering a somewhat more reasonable number of GPRs. Supposedly,
Microsoft persusaded them not to, presumably because they (and possibly
their customers) had grown used to the idea of using the segment registers
to access thread-local data, and didn't want to burn/dedicate GPRs
for this purpose (since there is real reason to believe that that
would lead to an overall performance decrease.)
- Back in the present (at last 2-3 years ago) and back in context.
Lisp thread-local data isn't the same as C thread-local storage;
fortunately, most 64-bit OSes provide a documented, supported
way of setting up whichever of %fs/%gs isn't being used by the
threads library for application-specific thread-local storage,
and CCL uses the provided mechanism on x86-64 versions on Linux,
FreeBSD, and Solaris.
Win64 doesn't seem to provide any means of doing this (at the
very least, we couldn't find any way of doing it.) We had no
choice but to have each thread keep thread-local data in a GPR
and lose that "few percent" of performance across the board.
I have no idea how Win64 sets up the segment register that
it uses for its thread APIS' use; as far as I can tell, this
happens during thread creation in the OS kernel.
OSX on x86-64 doesn't provide any way of setting up the unused
segment register (which happens to be %fs) so that it can be
used to address application-specifc thread-local data. I've
reported this as a bug and I'm sure that other people have as
well. I don't know why they didn't: whether it's something
they've thought about and decided not to address, whether they
haven't really thought about it, whether there are technical limitations
in the current (32-bit) OS kernel that make it prohibitively difficult,
whether all of the people that could implement it are busy working
on the iPhone, or what the issue is.
Apple does expose (slightly and in some sense) the mechanism
that the threads library uses to set up the segment register
that it uses. That means that (since lisp code has no interest
in quickly addressing addressing C thread library data but
C code expects the %gs register to address that data) we have
an alternative to simply accepting the "few percent" across-the-board
performance hit we'd take if we just burned a GPR: we can do a system
call to change where %gs pointes whenever we transition between
lisp and C code.(foreign function calls, callbacks, during exception
processing.) Those system calls are somewhat expensive (anywhere
from hundreds of nanoseconds to a few microseonds); that overhead
is likely negligible in many contexts (doing dixk or network I/O)
and is likely pronounced in other contexts (#_glVertex3f, surely;
perpaps other things.)
So, the fact that #_glVertex3F is slower than it could/should be is
not desirable but is not suprising.
That's my last attempt to explain this.
> alex
>
>>
>> On Wed, 18 Mar 2009, R. Matthew Emerson wrote:
>>
>>>
>>> On Mar 18, 2009, at 2:16 AM, Alexander Repenning wrote:
>>>
>>>> Playing around with some direct mode (i.e., not super efficient)
>>>> OpenGL code we notice some big speed differences:
>>>>
>>>> The time to draw a cube (4 sides only actually):
>>>>
>>>> 1.66 Ghz PPC Mac: MCL 88 us, CCL 1.3 60us
>>>>
>>>> 2.6 Ghz Intel Mac: CCL 1.3 32bit: 33us CCL 1.3 64bit 92us
>>>>
>>>>
>>>> I remember we had a discussion on PPC versus Intel call overhead but I
>>>> am a bit surprised about 64bit Intel Mac CCL on a much faster machine,
>>>> even ignoring the faster GPU, is slower than MCL on the old PPC.
>>>> Notice, the OpenGL commands are pretty trivial. Most of the time is
>>>> spent in the foreign function call overhead. How can CCL 32bit Intel
>>>> Mac be 3x faster than 64bit ?
>>>
>>> The first several paragraphs of the following message talk about this.
>>>
>>> http://clozure.com/pipermail/openmcl-devel/2008-December/008766.html
>>>
>>> In summary, due to missing functionality on Darwin/x8664 (no way to
>>> set the fsbase MSR) we resort to a workaround that involves performing
>>> a system call before and after each ff-call. When the foreign
>>> function doesn't do a lot of work, this overhead is significant.
>>> (When the foreign code does something expensive, like I/O, the
>>> overhead matters less.)
>>>
>>> On Darwin/x8632, we don't have to do the workaround, so ff-calls are
>>> cheaper.
>>>
>>> _______________________________________________
>>> Openmcl-devel mailing list
>>> Openmcl-devel at clozure.com
>>> http://clozure.com/mailman/listinfo/openmcl-devel
>>>
>>>
>> _______________________________________________
>> Openmcl-devel mailing list
>> Openmcl-devel at clozure.com
>> http://clozure.com/mailman/listinfo/openmcl-devel
>>
>
> Prof. Alexander Repenning
>
> University of Colorado
> Computer Science Department
> Boulder, CO 80309-430
>
> vCard: http://www.cs.colorado.edu/~ralex/AlexanderRepenning.vcf
>
>
More information about the Openmcl-devel
mailing list