In addition to unveiling its Cortex A7 processor on Wednesday, the press event was also a sort of second debut for the Cortex A15. The A15 will go into ARM tablets and some high-end smartphones during the second half of 2012, and it’s by far the best candidate for an ARM-based Macbook Air should Apple chose to take this route. Just as importantly, A15 will also go into the coming wave of ARM-based cloud server parts that have yet to be announced.
As part of the press materials for the A7 launch, ARM also released the first detailed block diagram—at least that I’ve been able to find—of the Cortex A15. The company also had the first working silicon of the A15 on display running Android. So let’s take a look at the A15 from top to bottom, because it is the medium-term future not only of the mobile gadgets that we all know and crave, but possibly of some of the servers that those devices will connect to.
Deeply pipelined, out-of-order
The A15 is a 15-stage out-of-order architecture, which makes it the same length as the venerable Intel Core 2 (Penryn) that has only recently been booted from all of Apple’s machines. Fifteen stages used to be fairly long pipeline, but by today’s standards it’s modest. Nonetheless, this deep pipeline means that the chip will scale to higher clock frequencies—indeed, it will need to scale to higher frequencies to get more performance—and higher frequencies mean more power. So A15?s pipeline is the first place where you can see the scales tipped slightly in favor of performance over power.
The power impact of the A15?s pipeline depth pales in comparison, however, to the fact that the chip is out-of-order. With out-of-order processing, the sequential instruction stream that flows into the processor’s front-end is dynamically re-arranged before being executed; after execution it is put back in program order and the results are written back to memory. All of this rearranging significantly boosts performance, but it also dramatically boosts power consumption as well. In order to re-arrange the instruction stream so that it executes optimally, and then put that stream back in the original order, chip designers have to add a number of extra storage structures to the CPU—rename registers, issue queues, and some sort of bookkeeping apparatus for tracking instructions in-flight and then putting them back in the correct order.
These storage structures are often in operation even when most parts of the processor are unused and powered down, so they represent a single point of energy drain—or a “hot spot.” You can think of the extra hardware that goes along with out-of-order processing as a lone referee in a soccer game—some players on the field might be idle at any given moment because the ball isn’t anywhere near their position, but that ref is constantly on the run because he always has to be on the ball (so to speak). These out-of-order storage structures also have to be fast and have the appropriate number of read and write ports, which adds to their size, complexity, and power needs.
The block diagram
The block diagram below shows A15?s pipeline in some detail. The pipeline starts with a five-stage fetch phase, where instructions are fetched from the L1 and predecoded. Instructions then move into the decode phase, where they’re decoded into micro-ops. (Yes, despite the fact that ARM is the most classic of classic RISC architectures, A15 still uses the very old “CISCy” trick of decoding ISA instructions into a smaller internal instruction format for reordering.) After the decode phase, any rename registers are assigned, and then the instructions are dispatched.
There is also a loop cache in the decode phase, a feature that is becoming common in processor front ends. The loop cache is a place to store the instructions that make up a loop kernel in decoded, micro-op form, so that the front end doesn’t have to decode them again and again on each loop iteration. This feature saves power and effectively boosts decode bandwidth.
A15 can dispatch up to three instructions per cycle to one of eight issue queues. Like A15?s pipeline, three instructions/cycle dispatch width is relatively modest by today’s standards. Some architectures do more, but you quickly reach a point of diminishing returns after three/cycle because instruction-level parallelism for most code just doesn’t go that high. So a four-wide dispatch would be aggressive on a part designed for mobile, which is why they went with three-wide.
Instructions issue to one of eight pipelines—five arithmetic-logic pipes, one branch pipe, and two memory pipes. The three of the five arithmetic logic units (ALUs) are scalar integer units, two of which appear to be single-cycle and used only for simple instructions, and one four-cycle multiply unit for more complex instructions.
As is common nowadays, the floating-point and vector units are combined into a two-pipeline floating-point/vector unit; the A15?s is 10 stages deep. This gives the A15 a modest but still respectable amount of floating-point horsepower, but it probably won’t need much more than this in its target applications. In smartphone and tablet SoCs, the A15 likely be paired with some sort of helper cores, like the two ARM M4 cores in TI’s OMAP 5430 SoC. These smaller cores consist solely of ARM vector units (i.e., they implement the NEON ISA extension), so some vector processing can be offloaded to them. Other vector workloads will no doubt be sent to the GPU on SoCs where that’s possible. And in its cloud server incarnation, A15 will mostly be doing integer workloads, so there’s not much need for real FPU horsepower.
To round out our discussion of the A15?s back-end, it has the typical two memory pipelines—load and store—which are basically just limited integer units that are used solely for address generation. Then there’s a branch unit for executing branches. Note that separating branch execution out into its own pipeline with a dedicated issue queue is also a classic feature of the PowerPC family—with Intel products, the hardware cluster that does the branch calculations is lumped in with the integer ALUs, and it certainly doesn’t get a separate issue queue.
The A15 in context
The Cortex A15 is in many ways a fairly straightforward iteration of a basic block diagram that has been around in one form or another since commodity processors made the jump to out-of-order processing. It falls well within a general lineage that goes back to the Pentium Pro and the PowerPC 604. This isn’t by any means a slam on A15, because the same thing can be said of current designs from Intel. (AMD’s Bulldozer is sort of an exception, and the results of its deviations from the norm are mixed so far.)
In terms of current-generation processors, the A15?s closest peer is probably AMD’s Bobcat, which is essentially an out-of-order take on Intel’s Atom, right down to the fact that it dispatches only two instructions per cycle (compare A15?s three) to the execution core. Intel’s in-order Atom will also be a direct competitor, but I think it’s likely that Intel will go out-of-order with Atom when the time is right, producing something that looks very much like Bobcat and A15. My best estimate would be that they’ll do this at 22nm.
All three of these chips—ARM’s Cortex A15, Intel’s Atom, and AMD’s Bobcat—will target both mobile devices and the cloud server space. In regards to the latter, all of them are very suitable for lightweight server tasks, like running a web server, and in a micro-server configuration like those from SeaMicro or Smoothstone, all three will offer more power efficiency for many types of cloud server tasks than larger, more complex parts like Intel’s Xeon.
The Macbook Air question
On the client side, it’s a given that a future version of the iPad will be based on the A15, either alone or in a big.LITTLE configuration with the A7 (my money’s on the latter). It will also be the first ARM part that’s truly a fit for the laptop space, a fact that’s important not only for the likes of the Chromebook and possible Android laptops, but also potentially for Apple. There have been persistent rumors that Apple will move the Macbook Air to ARM, rumors that I imagine find their origin in the fact that Apple certainly has an up-to-date internal ARM port of OS X running test hardware. In this regard, there are a few important things to keep in mind.
Years ago, I heard the back-story on Apple’s switch to Intel first-hand from some folks on the IBM side of things, and what I learned was that Steve Jobs agonized over this decision and waited until the morning of the keynote before pulling the trigger on this move. He actually went into that day with two keynote presentations prepared: one for a PowerPC-based product line, and one for The Switch. When he pulled out The Switch presentation, the IBM team was absolutely as stunned as the rest of the world, as was the P.A. Semi team who had been separately assured by Jobs that their dual-core PowerPC part would find its way into Apple portables.
This little anecdote confirms three things about Jobs and Apple that long-time Apple CPU watchers know very well:
1) As he said many times, Steve Jobs liked to have options. From the very beginning of OS X, Apple kept an internal, top-secret x86 port of the entire OS up-to-date, just in case Jobs might one day decide to make the change that he eventually in fact did make.
2) For whatever reasons of ego and romance, Steve Jobs also liked to have his very own, non-commodity CPU hardware. This attachment to proprietary hardware is why Apple was one of the original ARM partners at the latter’s launch, it’s why Apple stuck with PowerPC through the lean years, and it’s why Jobs agonized over the decision to finally give up on the dream and go with Intel.
3) Jobs was notoriously capricious and mercurial when it came to making big hardware decisions, much to the chagrin of Apple’s processor partners. The story of Mac CPUs is a story of surprise and heartbreak—not so much heartbreak on the consumer’s part, though there was that, but heartbreak on the part of Motorola, IBM, and P.A. Semi.
Number 1 above is my reason for saying that it’s obvious that Apple has a working, internal prototype of an ARM-based Macbook Air running an ARM port of Lion. Numbers 2 and 3 are why I would guess that if Steve Jobs were still with us, we’d eventually see an A15-based Macbook Air. But the fact that he’s gone makes this significantly less likely, though not necessarily impossible.
Intel’s Ivy Bridge mobile parts will be unbelievably competitive from a performance/watt standpoint, and anyone sane is going to want to prefer them in the Macbook Air over the A15. Staying with Intel through at least 2013 makes the most technical and business sense to any rational observer. But when Jobs was alive, there was always the chance that he would make the jump to A15 out of a combination of his lifelong obsession with the idea of having his very own, boutique CPU hardware and his notorious mercurialness. Now that he’s gone, however, Apple will be free to do the sensible thing and unplug the ARM-based Macbook Air prototype, so that they can spend their personnel resources on something productive. Whether they’ll actually do that is anyone’s guess, but I think they should.