[Openmcl-devel] high res timer for CCL

Wed Dec 17 11:23:32 PST 2008

On Tue, 16 Dec 2008, Alexander Repenning wrote:

> thanks Gary. Great feedback and code. I used the code to make a Hemlock 
> interface integrated timer:
>
> http://www.agentsheets.com/lisp/Clozure_CL/ns%20Timer.lisp
>
>
> One curious thing you will find in the code is that the timer function is 
> called 4 times. This is because the it takes quite a bit of time to call the 
> timer itself  (~3us on my intel mac) which needs to be compensated.  Further, 
> calling the timer function, or any foreign function, appears to take more 
> time the first time around. You can run this test:
>
>
> (without-interrupts
> (let (t1 t2 t3 t4)
> (setq t1 (#_mach_absolute_time))
> (setq t2 (#_mach_absolute_time))
> (setq t3 (#_mach_absolute_time))
> (setq t4 (#_mach_absolute_time))
> (format t "~%t2-t1=~A, t3-t2=~A, t4-t3=~A" (- t2 t1) (- t3 t2) (- t4 t3))))
>
> I am getting t2-t1 to be significantly higher on average than t3 - t2 or t4 - 
> t3
>
> by inserting the code to be measured between t3 and t4 one can estimate the

> time is takes to run the code to be (t4 - t3) - (t3 - t2) and ignoring t1
>
> Just wondering: why would it take more time the first time around? Is there 
> some kind of FFI call caching going on?

There's a performance hit on the first call to a foreign symbol whose
address couldn't be determined (because the defining library wasn't
loaded when the symbol was first referenced).  The defining library
in the case of #_mach_absolute_time is the standard C library 
(libSystem on OSX), so that wouldn't be a factor.  That's the only
kind of FFI caching that I can think of that -could- be involved,
and it pretty clearly isn't a factor here.

Doing a foreign-function call in CCL involves saving the thread's
lisp state in thread-specific memory, loading some foreign state
from thread-specific memory, doing the call, saving a little foreign
state, restoring the lisp state, and returning.  The few dozen
instructions on either side of the call keep the GC happy and are
pretty much unavoidable; the cost of imprecise/partially conservative
GC can be much higher

On x86-64 Darwin, there's an additional source of overhead related to
the fact that Darwin's missing basic functionality that most other
x86-64 OSes (that're less ... functionality-challenged ... than Darwin)
  recognize to be necessary.  Working around that involves
doing a Mach system call before and after each ff-call, and the cost
of each syscall and whatever context-switching is involved is probably
comparable to that of a few hundred instructions.  In the context of
"most" ff-calls - doing I/O, memory management, IPC, other
generally-expensive stuff - the extra syscall overhead seems to get
lost in the noise; in the context of calling #_mach_absolute_time,
the extra syscall overhead is pretty pronounced:

Your code (as the manual notes, WITHOUT-INTERRUPTS doesn't do what
you likely think it does):

(let (t1 t2 t3 t4)
  (setq t1 (#_mach_absolute_time))
  (setq t2 (#_mach_absolute_time))
  (setq t3 (#_mach_absolute_time))
  (setq t4 (#_mach_absolute_time))
  (format t "~%t2-t1=~A, t3-t2=~A, t4-t3=~A" (- t2 t1) (- t3 t2) (- t4 t3)))

generates results like:

t2-t1=341, t3-t2=146, t4-t3=140

in 32-bit Darwin x86 CCL, and generates results like:

t2-t1=1995, t3-t2=1809, t4-t3=1795

on x86-64 Darwin.  That's pretty pronounced - rouughly an order of
magnitud - and it's directly attributable to those extra syscalls
(Getting rid of the syscalls would speed up ff-call on x86-64 Darwin,
but would would make one less register available for general use
and would likely slow down many other things.)

----------
Dear Apple:

Boy, those iPhones are really neat.  I hear that next year they'll be
available in a wide range of exciting colors. I can't wait to enhance
my digital lifestyle!

P.S.  Please provide a way to set the fs.base MSR on x86-64 Darwin,
damn it.  Sheesh.
---------

In C - where there's no ff-call overhead involved, the numbers
are generally smaller still, as one would expect:

[~] gb at nachtwacht> ./machtime32
t2-t1=35, t3-t2=20, t4-t3=25

In all cases (though it seems to vary a bit), the t2-t1 number is
generally larger than the other numbers, as you note.  The best
guess that I have as to why this is true can be explained by looking
at the source to the function that implements mach_absolute_time.
In the sources that I have (which probably correspond to an early
OSX 10.5 release), the i386 version of that function is in 
.../osfmk/i386/commpage/commpage_mach_absolute_time.s.  Apple's
'commpage' functions - I don't know whether that stands for "common
page", "communication page", or something else - live at a fixed
address, are mapped into the address space of all processes, and
access data that can be modified by the OS kernel.  The routine
that returns the current time in nanoseconds looks like:

Lnanotime:
         pushl   %ebp
         movl    %esp,%ebp
         pushl   %esi
         pushl   %ebx

0:
         movl    _COMM_PAGE_NT_GENERATION,%esi   /* get generation (0 if being ch
anged) */
         testl   %esi,%esi                       /* if being updated, loop until 
stable */
         jz      0b

         rdtsc                                   /* get TSC in %edx:%eax */
         subl    _COMM_PAGE_NT_TSC_BASE,%eax
         sbbl    _COMM_PAGE_NT_TSC_BASE+4,%edx

         movl    _COMM_PAGE_NT_SCALE,%ecx

         movl    %edx,%ebx
         mull    %ecx
         movl    %ebx,%eax
         movl    %edx,%ebx
         mull    %ecx
         addl    %ebx,%eax
         adcl    $0,%edx

         addl    _COMM_PAGE_NT_NS_BASE,%eax
         adcl    _COMM_PAGE_NT_NS_BASE+4,%edx

         cmpl    _COMM_PAGE_NT_GENERATION,%esi   /* have the parameters changed? 
*/
         jne     0b                              /* yes, loop until stable */

         popl    %ebx
         popl    %esi
         popl    %ebp
         ret

For those who don't read x86 assembler: the code loops, busy-waiting
for the value of _COMM_PAGE_NT_GENERATION to be non-zero, executes
an 'rdtsc' instruction to get the current cycle count, uses the values
of some other COMM_PAGE variables to try to scale the cycle count
to a nanosecond count, computes a candidate value, then checks to
see if the value of _COMM_PAGE_NT_GENERATION has changed since it
was last sampled and does the whole calculation over again if so.

What this basically means is that the effective clock rate (number
of cycles per wall-clock time unit) isn't constant.  Most consumer-
oriented CPUs these days can run at reduced clock rates (e.g,
offer 'frequency scaling'.)  On some platforms, this is exposed
to the user (in a variety of ways).  A "2 GHz" Core 2 Duo laptop
can actually run at 5 different clock frequencies:

[~] gb at sensei> cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies 
2001000 2000000 1600000 1200000 800000

Obviously, the lower frequencies run cooler and cosume less power.  Most
Linux desktop environments provide control panels that display the current
clock rate and allow the user to exercise some control over if and how
frequency scaling is done.  My general impression (I may be wrong) is
that there's often quite a bit of latency: if the system is mostly idle
and running at a low clock rate, it takes "a bit" (for some value of
"a bit") of activity for system software to notice and to crank up
te clock rate, and that it may take "a bit" of inactivity for the 
clock rate to scale back down.  On Linux, I think that this is just
done by having a background process sample the CPU load every so often,
and if it did that too often, it'd be artificially increasing the system
load.

I'm not sure that OSX exploits CPU frequency scaling as such, but they
clearly do things that cause those COMM_PAGE_NT variables (that control
the cycles->nanoseconds mapping) to change, and from what I can tell,
they change frequently.  (Those other thinss might include putting
a CPU into some sort of short-duration sleep, until a timer goes off
or something important happens)  I think that the PPC version of the
CPU system preferences panel that came with CHUD used to have a checkbox
that controlled whether or not the CPUs were allowed to "nap"; I don't
think that I've seen anything on x86 Darwin that controls this.

If we knew exactly what policies governed this and a few other things.
we could probably come up with a specific answer as to why the 
t2-t1 time is almost always greater than the others.  I've just written
a few pages that effectively say "uh, I don't know", but it's pretty
clear that it'll take longer to determine elapsed time in nanoseconds
when the CPU frequency is changing than it would when it's stable, and
I think that it's likely a good thing (in terms of battery life, thermal
considerations, ...) if things that make CPU frequency variable happen
with low latency.

The performance of real-world things is affected by lots of things
that aren't entirely controllable (cache and TLB misses, context
switches, dynamic branch prediction mechanisms) that have costs that
aren't always predictable.  Even if high FF-CALL overhead and variable
sampling costs weren't factors, I'm not sure that I'd consider
nanosecond-resolution times for user-mode code on a non-real-time OS
to be particularly meaningful (e.g., the variability due to those other
things is likely a lot greater than that shown by your code above.)

  >
> Alex
>