[Openmcl-devel] Quick HW question...

Wed Nov 17 11:24:27 PST 2010

Thank you very much for all the specific information
about the current Intel processor product line!

I don't really think that what I was saying is "wrong".
It's just that some of what I said is or could be
true in general, even if it's not true of current
Intel processors.  The cache line size COULD
be different for an L2 and an L3, even if
the current Intel processors don't work
that way.

As you say, the L3 has to snoop the L2.
So, if you arrange for core 1 to use certain
parts of memory, and core 2 to use other
non-overlapping parts of memory, just
as a NUMA machine is programmed to do
in order to be fast, then it's all fast.
But if you share memory, and core 1
has to reach over to core 2's L2 cache,
it is slower, just as in a NUMA architcture.
NUMA means non-uniform.  The Intel
architecture you describe is non-uniform.

Now, if you take NUMA to mean not just
non-uniform, but a specific class of
architectures that are characterized
as NUMA just as shorthand because
they have other common attributes
besides being non-uniform, then, OK,
I'm not using the term the same way you are.

I don't see how I am wrong here.  It may
be that what you are saying is that the
performance conseqences of hitting
the other L2 cache are so small that
nobody should worry about them.
Maybe.  It depends, I suppose, on
how many cycles it takes to reference
your own L2 cache when you know that
it's value, versus how much time it takes
to reach out to the L3 cache and have it
snoop the other L2 cache to get your
data.  And, most imporatnt, whether
this happens frequently ("inside inner
loops").

So I certainly concede that I do not
know what the overall performance effect
will be on any particular benchmark.
They're probably all different.

But I think what I said is right in concept.

Again, thanks for the data and your analysis!

-- Dan

Lawrence E. Bakst wrote:
> At 5:01 PM -0500 11/15/10, Daniel Weinreb wrote:
>   
>> I agree with what your saying, and will even amplify it.
>> ...
>>     
>
>
>   
>> Dave Moon said to me serveral years ago that
>> the entire concept of looking at main memory
>> as if it were an addressible array of words
>> is entirely out of date if you're looking for
>> high performance (in certain situations).
>> You must think of it as a sequence of cache
>> lines.  And it gets more complicated once
>> you're dealing with both L2 and L3 caches,
>> which have different line sizes, and different
>> sets of cores accessing them.  When you have
>> a L3 cache, you really have a NUMA architecture
>> and if you want speed, you have to write your
>> code accordingly, i.e., a core should not read
>> and write data from some other L2 cache
>> than its own and expect that to be fast.
>>     
>
> Moon is right, and your overall point is a good one, but what you are saying about cores reading from other L2 caches, differing cache line sizes, and NUMA is mostly wrong.
>
> First it really depends on the processor. Most modern Intel X86 processors have a 64 KB L1 cache, split 32 KB of data and 32 KB for instructions. Hit latency is 3 cycles for Core 2 and 4 cycles for Nehalem. For Intel, an aligned  block of memory residing in L1 cache is almost as fast as a register.
>
> 1. A Core 2 Duo has a large *shared* L2 (my T7800/MBP has 4 MB) cache and no L3. In this case it doesn't make any difference at all which core accesses the L2.
>
> A Nehalem/Core i7 processor has a small (256 KB) per-processor L2 and a large (~ 8 MB) inclusive shared L3. An L1 hit is 4 cycles, an L2 hit is 10 cycles, and an L3 hit is ~41 cycles. A complete miss is 100s->1000s->10000s (ouch) of cycles depending upon the state of the memory controller and RAMs at the time of the miss. On this processor an L3 hit has to snoop L2 for the data, but at least it has valid bits for each L2 now. Next gen Intel (Sandy Bridge) will have L3 hit latency closer to ~ 25 cycles, which should really improve performance for L3 limited codes.
>
> 2. An L2 hit latency of 10 cycles compared to an L3 hit latency of about 40 cycles. A 4:1 difference. Next gen it will be closer to 2.5:1. I would not worry about data not fitting in a Nehalem L2, it's too small anyway. Nice if you can do it for small data structures, but not required.
>
> After considering L1, one way to think about L2/L3 caches is that you probably have either a large L2 or large L3, but probably not both. It's important to try an make the rest of your data fit in the large cache, either L2 or L3. Being smart about data alignment and sizes can really increase the space utilization of your cache and increase your hit ratio. If you fit in the large L2/L3 then let the system worry about moving data from L3 to L1/L2. Missing L3 is awful. Missing a small L2, not that bad.
>
> 3. I believe the L1/L2/L3 cache line sizes are the same for modern Intel architectures. The cache line size is 64 bytes.
>
> 4. I don't really think that having an L3 cache qualifies for NUMA status. That's term is more appropriate for compute clusters running memory traffic across links and not on-chip interconnects.
>
> However, it is true that "blocking" data to fit and align to cache lines is critical to performance and that has been true since at least 1980's. There are other many other issues as well, such as TLB coverage. Intel Vtune is your friend here.
>
> I have seen MIT CS PhD's struggle over optimizing instructions to speed up a CODEC when in reality the problem was really the memory system and caches. Everyone wants to optimize instructions and in my experience almost no one thinks about designing a data architecture that works well with caches.
>
> I worked at Alliant back in the 1980's where we built the world's first mini-supercomputer. I would say at least 50% of the analysts (their job was to optimize customers code) time was spent blocking data for the caches.
>
> Best,
>
> leb
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clozure.com/pipermail/openmcl-devel/attachments/20101117/98329639/attachment.htm>