Tuesday, March 20, 2007

Must We Multi-thread? (Part 2)

In the past decade, memory access has been the bottleneck in CPU-bound computing. With x86 family capturing the server market after the desktop market, we are stuck with CPUs that have small register banks (very small in fact.) You get 8 general-purpose registers in x86 (or 6 depending on what you count as general-purpose) and 16 on x86-64. Compare this to the 32-128 GPRs you get in RISC machines. Even Itanium has a whopping number of 128 registers! And the number of floating-point registers is shamefully low as well. It's either 8 or 16 (or 24?) depending on the instruction set you use. All this means is that we still need a lot of memory access, and memory buses and chips are not advancing as fast as CPU thread numbers and clock rates are. You may hear 800 or 1066MHz memory bus frequencies, or even 1600 and 2000 (for HyperTransport) but even then, the bandwidth is not enough. For a 128-bit wide, 1066 MHz memory bus (that's an Extreme Edition CPU with Dual-channel RAM!) you get the theoretical maximum throughput of 17GB per second (If you reach half that in best-case real-world scenarios, you're luckier than most anyone else!) If you have a single, plain 3GHz CPU, and only your CPU accesses the memory (which is absolutely not the case,) you'll have 5.7 bytes of memory bandwidth available to each executed instruction. That is, 5.7 bytes for the instruction and any data it might need. When your average instruction length is about 3 to 4 bytes (this is only an uneducated guess; anyone has more dependable data?) (your instructions can be as long as 17 bytes on x86) you can only read/write one word of data from/to memory every other instruction; otherwise your CPU will be stalled. This is a ridiculously low number. But, that's only a very simplified model. There are many other factors that I did not consider, for example: (The factors that will worsen the situation)
  • Memories never perform to their theoretical maximums. The real number is a lot lower.
  • I completely ignored the effects of memory latency.
  • Memories and memory subsystems have many kinds of delays and stalls when accessed randomly (or even sequentially) or for mixed read/write access.
  • Other systems in the computer can access memory and block off CPU's access (DMA, video card, ...)
  • Your average instruction length can be longer than the number above. A simple instruction with a memory address in it will likely be at least 6 bytes long and access at least 4 bytes of data. That 10 bytes of non-sequential memory access in a typical instruction.
  • CPUs (try to) issue more than one instruction per clock cycle.
  • Branch prediction (if fails) will mean executing more instructions that is necessary.
  • Some others that I don't know about.
(The factors that will help the situation)
  • There is caching in the memory subsystem.
That's it! The only fighter against the disaster is the cache. And it's a lot of research goes into caching at all levels. In the above system, we had a single, very typical CPU, with (almost) high-end memory. And even there we ran into severe, severe memory throughput limitations. That's why every programmer needs to be aware of caching and design her algorithms/data structures/programs to be cache-friendly. Let me ask you this: When was the last time you stopped and asked your self, "Self! What pattern of cache-use this function you've just written is going to have? What's the effect of that random-access into a 200K element array in the deepest place of the innermost loop on the CPU cache?" And that's just a single CPU. Throw in another CPU or three, and watch the memory access slow to a crawl because of all the cache misses. Of course, the cores are going to have separate caches with very good caching, and cache-coherency algorithms, so the situation is not that desperate. But it's bad enough. Especially if you work in performance-sensitive areas of programming, like (I hope) I do.

No comments: