<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
</head>
<body text="#000000" bgcolor="#ffffff">
Thank you very much for all the specific information<br>
about the current Intel processor product line!<br>
<br>
I don't really think that what I was saying is "wrong".<br>
It's just that some of what I said is or could be<br>
true in general, even if it's not true of current<br>
Intel processors. The cache line size COULD<br>
be different for an L2 and an L3, even if<br>
the current Intel processors don't work<br>
that way.<br>
<br>
As you say, the L3 has to snoop the L2.<br>
So, if you arrange for core 1 to use certain<br>
parts of memory, and core 2 to use other<br>
non-overlapping parts of memory, just<br>
as a NUMA machine is programmed to do<br>
in order to be fast, then it's all fast.<br>
But if you share memory, and core 1<br>
has to reach over to core 2's L2 cache,<br>
it is slower, just as in a NUMA architcture.<br>
NUMA means non-uniform. The Intel<br>
architecture you describe is non-uniform.<br>
<br>
Now, if you take NUMA to mean not just<br>
non-uniform, but a specific class of<br>
architectures that are characterized<br>
as NUMA just as shorthand because<br>
they have other common attributes<br>
besides being non-uniform, then, OK,<br>
I'm not using the term the same way you are.<br>
<br>
I don't see how I am wrong here. It may<br>
be that what you are saying is that the<br>
performance conseqences of hitting<br>
the other L2 cache are so small that<br>
nobody should worry about them.<br>
Maybe. It depends, I suppose, on<br>
how many cycles it takes to reference<br>
your own L2 cache when you know that<br>
it's value, versus how much time it takes<br>
to reach out to the L3 cache and have it<br>
snoop the other L2 cache to get your<br>
data. And, most imporatnt, whether<br>
this happens frequently ("inside inner<br>
loops").<br>
<br>
So I certainly concede that I do not<br>
know what the overall performance effect<br>
will be on any particular benchmark.<br>
They're probably all different.<br>
<br>
But I think what I said is right in concept.<br>
<br>
Again, thanks for the data and your analysis!<br>
<br>
-- Dan<br>
<br>
Lawrence E. Bakst wrote:
<blockquote type="cite" cite="mid:p0624081ac9095936be2c@%5B192.168.0.7%5D">
<pre wrap="">At 5:01 PM -0500 11/15/10, Daniel Weinreb wrote:
</pre>
<blockquote type="cite">
<pre wrap="">I agree with what your saying, and will even amplify it.
...
</pre>
</blockquote>
<pre wrap=""><!---->
</pre>
<blockquote type="cite">
<pre wrap="">Dave Moon said to me serveral years ago that
the entire concept of looking at main memory
as if it were an addressible array of words
is entirely out of date if you're looking for
high performance (in certain situations).
You must think of it as a sequence of cache
lines. And it gets more complicated once
you're dealing with both L2 and L3 caches,
which have different line sizes, and different
sets of cores accessing them. When you have
a L3 cache, you really have a NUMA architecture
and if you want speed, you have to write your
code accordingly, i.e., a core should not read
and write data from some other L2 cache
than its own and expect that to be fast.
</pre>
</blockquote>
<pre wrap=""><!---->
Moon is right, and your overall point is a good one, but what you are saying about cores reading from other L2 caches, differing cache line sizes, and NUMA is mostly wrong.
First it really depends on the processor. Most modern Intel X86 processors have a 64 KB L1 cache, split 32 KB of data and 32 KB for instructions. Hit latency is 3 cycles for Core 2 and 4 cycles for Nehalem. For Intel, an aligned block of memory residing in L1 cache is almost as fast as a register.
1. A Core 2 Duo has a large *shared* L2 (my T7800/MBP has 4 MB) cache and no L3. In this case it doesn't make any difference at all which core accesses the L2.
A Nehalem/Core i7 processor has a small (256 KB) per-processor L2 and a large (~ 8 MB) inclusive shared L3. An L1 hit is 4 cycles, an L2 hit is 10 cycles, and an L3 hit is ~41 cycles. A complete miss is 100s->1000s->10000s (ouch) of cycles depending upon the state of the memory controller and RAMs at the time of the miss. On this processor an L3 hit has to snoop L2 for the data, but at least it has valid bits for each L2 now. Next gen Intel (Sandy Bridge) will have L3 hit latency closer to ~ 25 cycles, which should really improve performance for L3 limited codes.
2. An L2 hit latency of 10 cycles compared to an L3 hit latency of about 40 cycles. A 4:1 difference. Next gen it will be closer to 2.5:1. I would not worry about data not fitting in a Nehalem L2, it's too small anyway. Nice if you can do it for small data structures, but not required.
After considering L1, one way to think about L2/L3 caches is that you probably have either a large L2 or large L3, but probably not both. It's important to try an make the rest of your data fit in the large cache, either L2 or L3. Being smart about data alignment and sizes can really increase the space utilization of your cache and increase your hit ratio. If you fit in the large L2/L3 then let the system worry about moving data from L3 to L1/L2. Missing L3 is awful. Missing a small L2, not that bad.
3. I believe the L1/L2/L3 cache line sizes are the same for modern Intel architectures. The cache line size is 64 bytes.
4. I don't really think that having an L3 cache qualifies for NUMA status. That's term is more appropriate for compute clusters running memory traffic across links and not on-chip interconnects.
However, it is true that "blocking" data to fit and align to cache lines is critical to performance and that has been true since at least 1980's. There are other many other issues as well, such as TLB coverage. Intel Vtune is your friend here.
I have seen MIT CS PhD's struggle over optimizing instructions to speed up a CODEC when in reality the problem was really the memory system and caches. Everyone wants to optimize instructions and in my experience almost no one thinks about designing a data architecture that works well with caches.
I worked at Alliant back in the 1980's where we built the world's first mini-supercomputer. I would say at least 50% of the analysts (their job was to optimize customers code) time was spent blocking data for the caches.
Best,
leb
</pre>
</blockquote>
</body>
</html>