[Openmcl-devel] Quick HW question...

Tue Nov 16 07:16:39 PST 2010

This is some good information.  Thanks for the pointers.  But it also
highlights an issue I've thought about from time to time: with modern
processor architectures (especially pipelines, caches, and now cores)
how does one _not_ write naive code for these things?  Sure, 90+% of the
worry on this goes to the compiler writers, but it can be easy to
accidentally write something that defeats their efforts.

/Jon

On Mon, 2010-11-15 at 17:01 -0500, Daniel Weinreb wrote:
> I agree with what your saying, and will even amplify it.
> 
> In fact, I was just at a talk at MIT by Nir Shavit. who
> does a lot of research into concurrency control
> mechanisms on real, current processors. 
> 
> http://www.math.tau.ac.il/~shanir/
> 
> He says that the cost of CAS (compare and store)
> instructions is very high compared to what
> you might think, on a multi-core system,
> and worse as the number of cores goes
> up (and the level of caches therefore
> increases).  The effect on caches is really bad,
> and hurting the caching these days really
> slow things down.
> 
> Dave Moon said to me serveral years ago that
> the entire concept of looking at main memory
> as if it were an addressible array of words
> is entirely out of date if you're looking for
> high performance (in certain situations).
> You must think of it as a sequence of cache
> lines.  And it gets more complicated once
> you're dealing with both L2 and L3 caches,
> which have different line sizes, and different
> sets of cores accessing them.  When you have
> a L3 cache, you really have a NUMA architecture
> and if you want speed, you have to write your
> code accordingly, i.e., a core should not read
> and write data from some other L2 cache
> than its own and expect that to be fast.
> 
> -- Dan
> 
> 
> 
> Your result about getting rid of the spin locks is
> less paradoxical than you might think, or even
> not pardoxical at all once you take a look at
> the data that guys like Shavit are doing.
> 
> Gary Byers wrote:
> > Someone asked about the i7, and I remember professing ignorance (several
> > paragraphs of it.)
> >
> > The Mac Pro (at least some models) use/have used Intel XEON processors
> > which in turn use HTT; it's reasonable to assume that the OSX scheduler's
> > been HTT-aware for some time.  (I don't know if it's true, but it's a
> > reasonable assumption.)
> >
> > CCL uses spinlocks on most platforms; acquiring a spinlock involves a
> > loop like:
> >
> > (block got-lock
> >   (loop
> >     (dotimes (i *spin-lock-tries*)
> >       (if (%store-conditional something somewhere)
> >         (return-from got-lock t)))
> >     (give-up-timeslice)))
> >
> > where *spin-lock-tries* is 1 on a uniprocessor system and maybe a few
> > thousand on a multiprocessor system.  On a system that uses
> > hyperthreading, that sort of loop interferes with the hardware's 
> > ability to schedule another hardware thread, and it's necessary to
> > use a special "pause" instruction (a kind of NOP) in the loop to
> > advise the hardware that the current thread wasn't really making
> > progress.
> >
> > While profiling a customer's application a few years ago, we found
> > that - when run on XEONs - a lot of time was reported as being spent
> > looping around waiting for spinlocks.  I said "D'oh!" and added a
> > "pause" instruction on the execution path, but that didn't make things
> > better; on Linux, we replaced the use of spinlocks with a slightly
> > more expensive mechanism, and (paradoxically) things improved, at least
> > on the XEONs.
> >
> > That never made sense to me, and I always suspected that something was
> > just wrong in the implementation (the "pause" instruction happened
> > out-of-line) or that we were seeing profiling artifacts.  At the time,
> > there wasn't time to explore this issue more fully, but those suspicions
> > have lingered.
> >
> > CCL generally does a lot less locking (of hash-tables, streams, ...) 
> > than it did a few years ago.  We still use spinlocks (some) on
> > non-Linux platforms, so if there was a bad interaction between spinlock
> > loops and hyperthreading it's likely still there but may not show up
> > as often.
> >
> > Other than that issue, I'm not aware of any way in which HTT is directly
> > visible to non-OS code.
> >
> >
> >
> > On Mon, 15 Nov 2010, Jon Anthony wrote:
> >
> >> Hi,
> >>
> >> I know the Wiki page for SysReq states no known issues with X86_64, but
> >> I seem to recall something passing through here about some "issue" with
> >> Intel Core i7 on Mac, or maybe generally, or am I just plain
> >> misremembering?  On a related note, does MacOS understand HTT (does it
> >> know the diff between physical and logical cores)?  Thanks for any info!
> >>
> >> /Jon
> >>
> >>
> >> _______________________________________________
> >> Openmcl-devel mailing list
> >> Openmcl-devel at clozure.com
> >> http://clozure.com/mailman/listinfo/openmcl-devel
> >>
> >>
> > _______________________________________________
> > Openmcl-devel mailing list
> > Openmcl-devel at clozure.com
> > http://clozure.com/mailman/listinfo/openmcl-devel