Comments on the Hammer

Comments on the AMD K8/Hammer System Architecture
from Pianoman <john@deater.net>

NOTE: This article was written a looooong time ago. Since that time, I am convinced that AMD -will- provide mechanisms for an OS to know memory<->processor mappings, allowing the OS to be as dumb or as smart about placement as it wants to be.

Judging from the AMD Hammer presentation at the Microprocessor Forum this past week, the Hammer is a very exciting system. If done right, it should be incredibly fast and scalable. I'm just a little concerned that it won't be done correctly. The reason for my skepticism is that this architecture seems to be a ccNUMA architecture masquerading as an SMP system. *IF* The hardware is designed in such a way that it can provide hints to the OS about where certain memory is, then the whole thing will work, and work beatuifully. If the hardware hides these hints, however, than the OS will have no means of making intelligent memory placement descisions, and MP performance could suffer. The end result would be degraded performance in an MP system.

Let me start my argument with an example:

I own a Fujitsu Pentium 133MMX laptop that has a 256KB of external L2 cache and the Intel TX chipset. The machine also has 80MB of SDRAM (16MB onboard, 64MB SODIMM). As any hardware junkie from 5 years ago will tell you, the TX chipset can't cache any memory above 64MB. I though it was just the VX chipset that had this limitation, so I went happily along using this laptop for a few years. It took approximately 50 minutes to compile a recent 2.4 Linux kernel. Then I stumbled onto a website that informed me of the TX 64MB restriction, and after a collective 'd'oh!', I used a little feature of Linux that allows you to turn uncached memory into a fast ram drive to use as swap. With only 64 MB of memory, and a really fast high priority swap device, kernel compile times dropped to approximated 30minutes. Kernel compiles are one of the many things that truely benefit from cached data, yet 1/5 of my RAM was never able to be cached. Since virtually contiguous memory doesn't have to use contiguous physical memory pages, all accesses to main memory had a 1 in 5 chance of -always- going to slow, high latency main memory. The result is that the average memory access time with cache enabled was much higher that it should have been. (Say a cache hit has a latency of 40ns, and main memory has a latency of 200ns. If 1 out of 5 references go to main memory, then the average access time is ((40+40+40+40+200)/5)=72ns, nearly twice what it should be.). The real wold performance increase speaks for itself.

Linux treated all memory on my machine as being equal, and was getting suboptimal performance. Once I taught linux that all memory was not equal, it was able to get optimal performance. There is no way for me to teach windows about the same problem on the same system. This situation is -very- similar to the situation faces on an MP Hammer system.

All memory are created equal, but some are more equal than others.

Each Hammer processor has an integrated high performance memory controller (PC2700 (DDR33), 64 or 128bits). This will mean that accesses to local memory will be incredibly low latency (as some of the complexity and wirelength problems have been removed), and should deliver high bandwidth. A single processor Hammer system will be a real screamer, delivering constant performance from any physical memory reference. However, now add a second processor, with its own local memory. AMD has stated in their presentation that the "Software view of memory is SMP" (slide 44). If this means that from the software point of view, all memory is equal, then we could have a problem (that is, IF the latency of going to remote memory is much higher (say, twice as high) as going to local memory). Since the software (and OS) will see all memory as equal, it will happily allocate physical memory from -either- physical address pool. Thus, as in the case of my little old uncached ram notebook, the average memory latency will increase. If the latency penalty is small for going to non-local memory, this isn't a big deal. But if the penalty is even 1.5x that of the latency of local memory, overall memory access latency will increase by 1.25x on a 2 processor system. This will result in overall lower memory performance from a MP machine than a uni-processor. And, given that memory performance is the -real- bottleneck in todays systems, it does no good to have a second processor, no matter how fast, if it sits idle waiting for memory accesses most of the time. The situation becomes worse as you add more processors (i.e., 4 processors, 3/4 chance of having to go to remote memory, 8 processors, 7/8 chance, etc..)

This isn't a new problem by any stretch of the imagination. This is the same problem that's been faced by supercomputer manufacturers for decades. The difference is that companys like SGI, Cray, and SUN designed the hardware so that it could provide hints to the OS on how to handle memory management intelligently, and then they designed the OS to take advantage of these hints. For example, if the OS knows that you have a single threaded process, it will try and keep all the ram that you allocate for that process in the same physical RAM as the processor you're running on. This allows that process to always have the fastest memory accesses possible, and only going "off chip" when absolutely necessary. If the OS was not aware of the memory layout of the machine, this processes allocated memory could be spread out throughout the whole system, causing unecessary contention on the HT network and higher memory latencies. In this sense, each processors local memory can almost be thought of as a "level 3" cache. It's a much more complex OS design, and very difficult to get right, but it does allow for maximum performance in a lot of cases.

For this reason I implore AMD to provide some sort of mechanism to allow OS developers to take full advantage of the NUMA aspects of their new system design. Those OS's that choose to take advantage of them can, and those that don't will still work fine.

Even if AMD decides not to provide hints to the OS, there are still a few things that they can implement (and quite likely have for this very reason) that will help alleviate the problems.

Large, efficiently managed caches
Minimize the difference in latency between local and non local accesses

Large, correctly managed level 2 caches mean that there are less memory references to begin with. Applications can work with large datasets without even going to local memory. AMD appears to be spending a lot of time on just this in the Hammer, and I have heard rumors of 1MB of level 2 cache on each Hammer.

As for minimizing the difference in latency between local and remote address reads, AMD has stated numbers of ~140ns for a remote read on an -unloaded- HT bus on an 8 processor system. This probably includes cache coherancy transactions as well. This is very impressive, but I would like to see some number backing this up. I'd especially like to see numbers on a -loaded- HT network.
AMD dismisses this saying that latency shouldn't increase much with load becauseof the large amout of bandwidth available with HT. And in princlple, I agree, but I would like to see some hard numbers.

Lets do some calculations. The lowest latency SDRAM controllers I've seen run at approximatly ~150ns. Given that the increased speed of PC2700 SDRAM and interleaved 128bit path to it, average memory latency should be somewhere around 100ns (ok, so this is a complete guess on my part, so take with a grain of salt). On a 4 processor system, if remove memory reads take 140ns as AMD indicated, than the real memory latency of a system with physical pages spread throughout the entire system would be ((100+140+140+140)/4)=130ns. Remember, these numbers are theoretical, but they illustrate the point that even a 40ns speed difference can lead to real performance loss.

If the cost of going to remote memory on MP systems becomes too great, it may almost be worth while to run a different instance of the OS for each processor, and use the HT network as a high bandwidth communication mechanism between them. This solves the remote memory lag problem, and could still allow an increase in throughput in some applications.

Comparing this NUMA to traditional SMP

The reason that this problem has never popped up before inthe ia32 world before is because on true SMP machines, such as Intel's ia32 implementation, is that each processor has equal access to a global pool of memory sitting off the processor bus (or, in the case of the AMD 760MP, each processor has equal, DIRECT access). If you wanted to, you could turn the Hammer architecture into just such a machine by making a "HT to memory controller" brigde, without a CPU, and putting that on the HT network. Then all memory accesses would be "remote", have the same (larger) latency, and OS's would work the same. But you lose any chance of using locality to your benefit.

Conclusions

I really do think that the Hammer system architecture is a good, forward-looking design that has a lot of promise. It is years ahead of similar Intel systems, and has many, many, many excellent design decisions behind it. It will be the next system that I buy. I just hope that AMD does add some way for the hardware to tell software about the physical layout of memory so that OS designers can take full advantage of it.

I welcome any and all comments my theories. Please feel free to email me with anything you may wish to talk about, especially if you have corrections or additions. I'm prone to being very wrong about things sometimes. I can be reached at john@deater.net .

Other random throughts:

Gee, isn't this almost the exact same design as the Alpha EV7? Didn't that have an integrated northbridge and crossbar communications hub? Long live the AXP..(not AthlonXP, although, that too..)

So now we have an open processor bus.. how long until someone builds a PS2 processor with a HT interface to plug into your system bus? givng that equal access to memory as the CPU's, make multi-architecture systems possible (don't laugh, this already happens on the PCI bus... that GigE card you have probably has 2xMIPS or 2xARM cores in it, churning away on packets and quietly placing them into your memory.. )

Hmm.. an OS thread on each processor....

For more infor about me, you can check out my poorly maintained homepage here.

Hmm.. i'm probably about to be unemployed, anybody need someone with a mind for system design? Resume here.