[Openmcl-devel] Quick HW question...

Wed Nov 17 03:35:24 PST 2010

At 5:01 PM -0500 11/15/10, Daniel Weinreb wrote:
>I agree with what your saying, and will even amplify it.
>...

>Dave Moon said to me serveral years ago that
>the entire concept of looking at main memory
>as if it were an addressible array of words
>is entirely out of date if you're looking for
>high performance (in certain situations).
>You must think of it as a sequence of cache
>lines.  And it gets more complicated once
>you're dealing with both L2 and L3 caches,
>which have different line sizes, and different
>sets of cores accessing them.  When you have
>a L3 cache, you really have a NUMA architecture
>and if you want speed, you have to write your
>code accordingly, i.e., a core should not read
>and write data from some other L2 cache
>than its own and expect that to be fast.

Moon is right, and your overall point is a good one, but what you are saying about cores reading from other L2 caches, differing cache line sizes, and NUMA is mostly wrong.

First it really depends on the processor. Most modern Intel X86 processors have a 64 KB L1 cache, split 32 KB of data and 32 KB for instructions. Hit latency is 3 cycles for Core 2 and 4 cycles for Nehalem. For Intel, an aligned  block of memory residing in L1 cache is almost as fast as a register.

1. A Core 2 Duo has a large *shared* L2 (my T7800/MBP has 4 MB) cache and no L3. In this case it doesn't make any difference at all which core accesses the L2.

A Nehalem/Core i7 processor has a small (256 KB) per-processor L2 and a large (~ 8 MB) inclusive shared L3. An L1 hit is 4 cycles, an L2 hit is 10 cycles, and an L3 hit is ~41 cycles. A complete miss is 100s->1000s->10000s (ouch) of cycles depending upon the state of the memory controller and RAMs at the time of the miss. On this processor an L3 hit has to snoop L2 for the data, but at least it has valid bits for each L2 now. Next gen Intel (Sandy Bridge) will have L3 hit latency closer to ~ 25 cycles, which should really improve performance for L3 limited codes.

2. An L2 hit latency of 10 cycles compared to an L3 hit latency of about 40 cycles. A 4:1 difference. Next gen it will be closer to 2.5:1. I would not worry about data not fitting in a Nehalem L2, it's too small anyway. Nice if you can do it for small data structures, but not required.

After considering L1, one way to think about L2/L3 caches is that you probably have either a large L2 or large L3, but probably not both. It's important to try an make the rest of your data fit in the large cache, either L2 or L3. Being smart about data alignment and sizes can really increase the space utilization of your cache and increase your hit ratio. If you fit in the large L2/L3 then let the system worry about moving data from L3 to L1/L2. Missing L3 is awful. Missing a small L2, not that bad.

3. I believe the L1/L2/L3 cache line sizes are the same for modern Intel architectures. The cache line size is 64 bytes.

4. I don't really think that having an L3 cache qualifies for NUMA status. That's term is more appropriate for compute clusters running memory traffic across links and not on-chip interconnects.

However, it is true that "blocking" data to fit and align to cache lines is critical to performance and that has been true since at least 1980's. There are other many other issues as well, such as TLB coverage. Intel Vtune is your friend here.

I have seen MIT CS PhD's struggle over optimizing instructions to speed up a CODEC when in reality the problem was really the memory system and caches. Everyone wants to optimize instructions and in my experience almost no one thinks about designing a data architecture that works well with caches.

I worked at Alliant back in the 1980's where we built the world's first mini-supercomputer. I would say at least 50% of the analysts (their job was to optimize customers code) time was spent blocking data for the caches.

Best,

leb
-- 
leb at iridescent.org