More Comments on AMD's Hammer.
by Pianoman <john@deater_net>


The Opteron is here, and it is fast.. and then some.  Faster then an Itanium2 in some cases, and faster the the Xeon almost everything, even with hyperthreading. AMD has a real winner on it's hands, and yet, not many people seem to realise that so far we've just seen the tip of the iceberg performance-wise.

So what makes the Opteron soo good, anyway?
Lets go over the areas where the Opteron has improved performance over a traditional ia32 chip. The first category of improvements are hardware and architectural improvements that will apply to every mode the processor runs in. These improvements do not require any special compiler or OS support.  They are:
If you baught an Opteron today and ran any current version of Windows on it, and any current program on it, you would see the benefits of all of these improvements. And these combine for some very impressive performance increaces across the board. A "souped up athlon". Unfortunately, in some reviews, these are all the improvements you'll see...but I'm getting ahead of myself. More on that later.

The second category of improvements revolve around the x86-64 extensions that the Opteron brings to the table. In order to experience these improvements, you'll need a new OS and a binary recompiled for x86-64. These include such things as:
This is the real performance goldmine, but you'll need to be running Linux/x86-64 and using GCC (or another x86-64 capable compiler) in order to take full advantage of it.

The final category of improvements are in the systems architecture.  That would be:
While any current OS will be able to use the memory on an Opteron system (up to 4GB, then, I'm assuming, they'll have to revert to using PAE). A careful OS would be able to glean much more performance by controlling memory placement (see my first comments on the Hammer here).

Ok, ok, so do the second and third categories really matter?
It's fairly obvious that the first set of improvements are a major factor as to why the Opteron is performing so well. So how much of a difference will the second and third categories really make? Well, let me highlight some of the important ones in reguards to performance.

64-bit addressing/64-bit datapath. This is actually a two edged sword. On an application level, you geta  tremendous advantage when dealing with large data sets, or when dealing with larger-then-32-bit numbers.  However, most programs don't need fall into this category, however.  The x86-64 architecture is actually rather nice in that if you don't -need- 64-bit data you can use the old 32-bit instructions just fine and only use 64-bit ones as necessary. However, a pointer is always going to be 64-bits in x86-64 mode, so some applications might take a slight performance hit because of the extra strain on the memory bus caused by loading 64-bits for every pointer instead of 32. This is one reason why AMD gave the Opteron such a fast, wide memory bus.

From an OS point of very, however, this is incredibly useful. While a full explination of VM, page tables, and intel's PAE hack to support more then 4GB at a time is beyond the scope of this article; suffice it to say that the Opteron can support up to several terabytes of memory directly, and can access all of it much more quickly then Intel 32-bit chips. This makes every memory reference faster.

More registers. The main complaint about the x86 is that it only has 8 general purpose registers. The Opteron has 16. This means that every time the compiler decides that code needs to work with more then 8 values at a time it can store up to 8 more variables in another register as opposed to storing it in the stack or in main memory. I remember reading somewhere that the extra registers cause code compiled for the x86-64 to make up to 1/3 fewer memory references then traditional x86 code. Also look at this email which shows that, for at least these compiled binaries, the number of stack manipulations required for x86-64 is far, far less then those required for traditional x86.  (For example, the compiler for x86 generated 117723 'push's to the stack, while the x86-64 code generated just 20264). So, in short, the extra registers help keep the processor busy instead of waiting for more data.

NUMA. If the OS is capable of being intelligent of where it places an application's memory, then the overall memory bandwidth and latency can be maximised for a specific program.  This requires an intelligent, modern OS. Linux supports this at least to a certain degree, but all current versions of Windows do not (I have heard rumors Windows Server 2003 will, but I can't verify that). Therefore, when running in an SMP environment, the average latency and bandwidth to memory running under most current operating systems will be less then optimal.

Only when all these performance enhancements are combined will we be able to fully see the true power of the Opteron, and I believe we're in for a huge surprise when we do.

Lies, Damn Lies, and Benchmarks (for once, they're -underestimating-)
So lets take a look at some of the benchmarks people have been using to showcase the Opteron.  Lets start with AMD's own.  You can find them at AMD's page here. First, the TPCC benchmarks. That's a -highly- impressive number for the 4-way Opteron!  it gives a 4-way Itanium 2 1.0 Ghzwith 16 more GB of memory a run for it's money, while handily beating the 4-way Xeons. Very impressive. Now look at the configuration information, and try not to laugh... yep, that's right. It's running on MS Windows 2003, and running MS SQL server 2000, neither of which are compiled for x86-64! Considering MS won't even have a beta of Win2003 server for x86-64 unil the end of June, it has to be. Also consider the fact that the Itanium confiuration explicitly spells out "Windows 64-bit edition", while the Opteron configuration does not. This also means, since the Opteron system has 32GB of memory, that the OS is using the old, slow Intel-style PAE hack to address all of its memory instead of the new, fast, direct addressing available on the Opteron. Just imagine the speedup possible when the OS is able to directly access all the memory and the application has to make 1/3 less memory references. Right now, you're essentially running 4 "souped up Athlons".

So, what we essentially have, is an entire benchmark suite that the Opteron is kicking butt in, and it's only flexing some of its muscle. Salivating yet? just wait, it gets better...

Now look at the Ms Exchange benchmarks. It's the same story, only we know for a fact it's not a 64-bit OS because the OS is Windows 2000, and the app is Exchange 2000, which will never be compiled for x86-64.  Again, these Opterons in "souped up Athlon" mode are doing extremely well.

The spec scores? Yep. MS Windows 2003 Server. 32-bit OS, 32-bit Apps. More scary? Read the spec submission PDF.  They had to use Intel's C compiler, so again, you're really looking at the spec scores of a processor that using only about 2/3 of its capabilities. Given that the spec scores are soo high, that's just downright frightening. I imagine a 20% boost in SpecFP from the extra 8 SSE2 registers alone. This also underscores a major problem with AMD's launch of the Opteron, the lack of a highly optimised compiler for 64-bit mode. But I'll get to that later.

At the bottom of AMD's page, things get very, very interesting. Recall that for every other benchmark listed on this page, the Opteron has been running in souped up Athlon mode. Also note that, while they are doing very well against Xeon's, AMD carefully only shows Itanium scores from Itanium 2 1 Ghz or 900Mhz. where are the 1.5Ghz Itanium2 scores? Well, maybe they weren't available, I don't know. But in general, the trend is "Opteron is better than the Xeon, a little worse than the Itanium 2".  But the SpecWeb99 scores change that.

For the first time, we see a 64-bit binary running on a 64-bit OS. And suddenly, the Opteron is starting to really flex it's muscle, beating out even a 1.5Ghz Itanium 2.  Unfortunately, the benchmark chosen is really a very poor showcase for the Opteron as a processor, as specweb relies much more on the size of your memory, disk speed, and OS performance then on processing power.  In the "32-bit vs. 64-bit" slides we can see the speed advantage of 64-bit direct addressing as opposed to PAE addressing.  The machine contains 16GB of memory, so a 32-bit OS would have to use PAE mode.  As Zeus does little more then tell the OS "send this page", i doubt it got much of an advantage from being recompiled, so I'm conjecturing that almost all of the 14% boost in speed comes from the OS being able to address all 16GB of its in-memory buffer cache directly, instead of going though PAE hijinks.

So, AMD doesn't seem to be showing off many benchmarks that show the Opteron operating at it's full potential. What about anyone else? Remember, the only x86-64 64-bit OS out there right now is Linux (and NetBSD, and FreeBSD...) so you'll have to be running that, so no Windows platform.  This also means you're pretty much stuck with GCC, which isn't always that good of an optimising compiler.

Well, Toms Hardware as at least attempted some 64-bit tests using Linux, but didn't indicate whether the apps the used were recompiled for 32-bit or 64-bit. While i personally can't stand the site (I'm making no attempts to be unbiased here), they did at least try.

Aces Hardware also did several nice Linux tests, breaking down the x86-64 compiled ones and more.

Every other review I've read that's given performance data has been using the Opterons in 32-bit mode.

Bottom line is, we have very few pure x86-64 environments and test results...And the platform is still impressive. If only AMD could get thier act together...

Where AMD Dropped the Ball
AMD has only themselves to blame for why the Opteron is, for the most part, only using part of it's impressive ability. As with the P4 and the Itanium, it all comes down to compilers. Intel realised it needed good compilers to show off not only the Itanium, but also the P4 (which re-wrote every x86 optimization in the book.. so years of x86 optimisation theory had to be thrown out and re-done from scratch.  Not to mention SSE). So Intel went out and baught Kruck and Associates, makers of a very good optimising C and Fortran compiler; and also managed to finagle pretty much the entire Compaq compiler group when HP took them over (no, there were no back-room dealings there... naah... that never happens..). The result? Intel has a very fast compiler now. AMD, until now, has only made x86 clones, so they've just used the Intel compler.

However, now AMD has it's own architecture, which needs it's own compilers to take full advantage of it. Realizing this, AMD actively helped the GCC people write an x86-64 backend. However, while GCC is a very good general purpose compiler, but is not nearly as good at high performance optimisation as Intel's compiler. Microsoft's compiler is a very good optimising compiler, but appearently has yet to come out with full support for x86-64. The only other Optimising compiler out there for the x86-64 that I know of is the Portland Groups PGICC. However, this appearently still isn't as good in 64-bit mode as intel's is in 32-bit mode, or else one would think AMD would have submitted spec scores using this compiler instead of Intel's.

AMD needs to either partner with or buy up a compiler company to get a real, generally available compiler out there optimised for thier architecture. As it stands right now, the only compiler generally available that can make full use of all the Opteron's features is GCC, and GCC is not capable (yet) of supporting the types of optimisations that will be needed to be competative. They can no longer rely on Intel.

The only other possible explination for the current situation is that the 64-bit code is too bulky and thus overshadows any benefit gained by the extra registers abailable in 64-bit mode, but i just can't fathom that as a likely scenario.

Conclusions
Intel should be scared. The Opteron is beating its flagship products pretty handliy, and yet has barely  started to use most of its capability. In every test that I've seen a Xeon or P4 "beat" an Opteron, the Opteron has been running in 32-bit mode, using only half it's registers and 32-bit addressing. Not to mention the fact that the Opteron is only running at 1.8Ghz, and should be able to scale easily to 2.2Ghz, and quite probably close to 3Ghz before a major redesign is needed. AMD should be able to release faster processors almost at will for the next 6 months at least.

However, it's not the hardware that AMD needs to improve. It's the software. I can't wait to see a spec score for an Opteron thats compiled with a decently optimising compiler for the x86-64 architecture. With that in place, AMD could, quite frankly, have the fastest CPU in the world.

A word of caution, however, to anyone who is reading reviews or sees anything about the Opteron. Always always always find out as much info as you can about what what software it was running! And if it's not running in 64-bit mode, then just think of how much faster it could be.

The hardware's there, just waiting to be used.

Comments, corrections, discussion welcome.
-pm

About the Author
I could be way off base on all this, but I don't think I am. Also, writing at 4am because of a bout of insomnia doesn't help clear my head; so apologies for any mistakes, misconceptions, or misnomers.

I'm a student of Computer Architecture, I always will be no matter how old I am or who I work for, and I write about stuff that I enjoy. I am currently happily employed, but you can view my resume and my poorly maintained homepage. You can view my other ramblings about HPC and Linux utilities as well.

peace.