Wednesday’s launch of AMD’s new “Bulldozer” processor is widely seen as a make-or-break moment for the struggling chipmaker. Bulldozer’s architecture is highly unconventional, and it’s being pushed out into the market amid a high-profile leadership shakeup at AMD. It’s also tasked with taking on an Intel that is firing on all cylinders — Intel’s profits are at record levels, its “tick-tock” model for manufacturing process advancement is moving ahead flawlessly, and its upcoming tri-gate transistor introduction at the 22nm process node will give the company a massive on-off boost in efficiency and performance. So with this much riding on Bulldozer, and with this much of an uphill battle ahead of it, how does the launch version of the chip — codenamed Orochi — stack up against Intel’s Core family?
A look at the first three Bulldozer parts
Bulldozer comes to market in three flavors, the specs of which are listed in the chart below:
|FX-6100||6||3.3 GHz||3.6 GHz||3.9 GHz||6 MB||95 W||$165|
|FX-8120||8||3.1 GHz||3.4 GHz||4.0 GHz||8 MB||125 W||$205|
|FX-8150||8||3.6 GHz||3.9 GHz||4.2 GHz||8 MB||125 W||$245|
The first thing that will jump out to veteran CPU watchers about this chart is that the power consumption (the “TDP” column) and clockspeed numbers are quite high — significantly higher than comparable Intel parts. Bulldozer relies on higher clockspeeds to boost per-thread performance, and the chip pays that in wattage, drawing more power than comparable Sandy Bridge chips. But Bulldozer also sports more threads per socket (eight) across all three models than all but the hyperthreaded Core i7, so the wattage boost is amortized over the higher number of threads so that the net per-thread efficiency should be similar … at least, depending on the workload, but more on that in a moment.
You’ll also notice there’s no column giving the number of cores per processor — I’ve included only the number of threads. This gets at an interesting architectural wrinkle that is at the root of Bulldozer’s problems and its promise.
Each of the Orochi Bulldozer processors launched today has four “modules” per die. (The three-module FX-6100 still has all four modules, but one is disabled for yield and product binning reasons.) Each of these modules is sort of like a “core”, but not quite.
A typical CPU core consists of a front end, which takes in an instruction stream and sends it to either an integer unit or a floating-point unit for execution. Each Bulldozer module, in contrast, has a front end, two integer units, and a floating-point unit; what’s more is that each front end takes in two instruction streams simultaneously, which it can then feed to one of the two integer units.
This makes each Bulldozer module essentially a core-and-a-half, at least as far as integer code is concerned. But a better way to think of a Bulldozer module is as a single, dual-threaded CPU core where the integer units are replicated. A typical processor that supports simultaneous muli-threading (SMT) replicates and/or enlarges storage structures like thread state, register files, and scheduling buffers and queues. A ‘Dozer module does all of this same replication, but it also replicates integer execution hardware — it’s this replication of integer execution hardware that is the main difference between a classic SMT design and Bulldozer.
So a four-module bulldozer part supports up to eight threads of simultaneous execution, and sports a total of eight integer units and four floating-point units. Again, this makes it essentially a four-core SMT chip with double the integer resources.
As for off-chip I/O, Orochi sports four HyperTransport links and a dual-channel DDR3 controller.
Power gating keeps the chip’s idle power down, and a turbo feature lets it ramp its clock speed up in short bursts for extra horsepower.
The part is fabbed on GlobalFoundries’ 32nm high-K SOI process.
How it performs
Benchmark results show that Bulldozer scales well with clock speed increases, so it’s no wonder that AMD has pushed those frequency numbers up. But given the performance of this debut desktop part on the kinds of applications that normal users will want to run, AMD should’ve gotten the clock speed up even higher. For most desktop scenarios, Bulldozer in its current incarnation just doesn’t cut in bang per buck versus Intel’s Core i5.
Probably the most obvious desktop application category, and one that has historically been near and dear to AMD’s heart, especially post-ATI merger, is gaming. Gaming is also where Bulldozer really falls down, performing at the middle of the pack and often below AMD’s older Phenom chips. It’s also the case that Bulldozer is no great shakes on anything related to image processing and encoding/decoding. But none of this is a surprise.
What image processing and gaming have in common is that both are floating-point intensive workloads that lend themselves to multithreading. Bulldozer has plenty of threads to go around, but as we saw above, every two threads share a single floating-point unit (FPU). So on a per-thread basis, Bulldozer is a bit starved for FPU bandwidth, and this is what keeps those gaming scores low.
The other place where Bulldozer is hurting is in the caching and memory subsystem. Anand’s and Scott’s benchmarks show that Bulldozer has cache latencies that are significantly higher than competing Intel chips, and this hurts performance on most types of code. Bulldozer also has only one dual-channel DDR3 controller to service eight threads of execution. Again, performance would no doubt greatly improve — especially on floating-point code, which is also typically bandwidth intensive — with another dual-channel controller.
So the double-whammy of scarce memory and FPU resources makes for a relatively weak showing versus much cheaper Core i5 parts in games and most media applications. This alone is going to be deadly to Bulldozer’s aspirations on the desktop.
Integer-intensive applications tell a different story, though, and this is what gives me some hope that Bulldozer will find a home in the datacenter. The chip did very well on Scott’s Zipfile compression benchmarks — this is a classic multithreaded integer workload, so it plays to the chip’s strengths. It also did well on encryption benchmarks, which are similarly integer-bound and amenable to parallel processing. This latter showing is especially important for ecommerce, where servers may have to handle a number of encrypted connections.
But these two bright spots are pretty small compared to the rest of the benchmarks where Bulldozer just didn’t measure up. In all, it’s a pretty thin reed on which to hang the case for AMD’s resurgence in the server market.
To make headway, AMD will have to get Bulldozer’s clockspeed and per-thread performance up, and will have to do something about the memory and caching bandwidth situation. I’m assuming that the Bulldozer server parts will have another memory controller, bringing the total number of DDR3 channels up to four; this will help.
The more fundamental problem is that Bulldozer’s novel architectural choices aren’t an obvious success. What AMD has done is to essentially double down on simultaneous multithreading, a technology whose performance was always very workload-dependent. This means that Bulldozer’s will be very sensitive to the workload type, no matter what AMD does. It may turn out, though, that some types of common cloud workloads will be a good fit for Bulldozer, and will give the part a performance/watt or performance/dollar advantage versus Intel. But right now, this is just speculation. What is certain is that AMD needed a homerun, but Bulldozer — at least in this first incarnation — is a double at best.