It's interesting to see a very x86_64-like attempt to shake off the weirdness of the ancestral architecture here. The PC is no longer an addressable register. Thumb has been dropped. The predication bits are no more. The weird register aliasing thing done by NEON is gone too. The register banking (and it seems most of the interrupt architecture) is entirely different.
And, just like in the Intel world, market pressures have introduced all new CISC quirks: AES and SHA256 instructions, for example.
But of course an architecture document does not a circuit make. All the weirdness (old and new) needs to be supported for compatibility (OK, maybe they can drop Jazelle), so the fact that they no longer talk about some things doesn't really save them any transistors in practice.
Honestly, this is sounding more like an Intel or AMD part, not less.
Is having hardware acceleration for AES and SHA256 really a "CISC quirk", or just a really specialized set of arithmetic instructions? The classic RISC idea of making the core simple and fast doesn't really apply here; internally, it's all simple micro-operations driving special-purpose hardware. It seems similar to having a fused multiply-accumulate operation: they've figured out how to accelerate the core of a common task, and this is the API they've decided to give it.
That's actually an almost reasonable definition of CISC for laypeople. Take a look at the definition on wikipedia.
A complex instruction set computer (CISC, play /ˈsɪsk/)
is a computer where single instructions can execute several
low-level operations (such as a load from memory, an arithmetic
operation, and a memory store) and/or are capable of multi-step
operations or addressing modes within single instructions.
Now the reality is that on the whole things are not quite as cut and dry. In this case they're doing it to give access to dedicated hardware for power gains most likely, which is why something that's typically close to RISC would add something like that. As time has gone on, both CISC and RISC systems have moved more toward a blend of both in-order to get the best of both worlds, from what i've heard interally most x86 chips actually work like a risc chip they just translate between things in the instruction decoder.
> A complex instruction set computer (CISC, play /ˈsɪsk/)
> is a computer where single instructions ... and/or are
> capable of multi-step operations ... within single
> instructions.
What's a "multi-step operation"?
I ask because I worked on the microarchitecture (read "implementation") of a microprocessor that had what was generally regarded as a very RISC instruction set.
Yet, almost every instruction had multiple steps. Yes, including integer add.
Were we doing something wrong?
And no, "one cycle fundamental operations" doesn't change things. Dividing things into cycles is a design choice. For example, one might reasonably do integer adds in two steps.
If those ops are register-register, how are they necessarily not-RISC?
Yes, division is inherently more complex than bitwise NAND, but it's not obvious to me where the line is that you find so clear.
FWIW, I've seen a very serious architecture proposal that used two instructions for memory-reads. (It had one instruction for memory writes.) Along those lines, register-value fetch can be moved into a separate instruction....
The sse1 instruction provide the option of register-register, but also support register-memory. I didn't realize it supported register-register mode, so now I see why it would be less obvious to you.
Why is copying a value from register to memory (or memory to register) "RISC" while performing some logical operation to the value to the value as moves "not risc"?
I'd agree that memory to memory is "not risc", but given the amount of work necessary to do a register access, it's unclear why doing work on a value is "not risc".
Datapaths are NOT the complex part of a microprocessor.
Mmm, I think RISC has to be a relative term. (It does, after all, have "reduced" in its name, which implies a comparison with a less-reduced alternative.) So every time processor A has an instruction that can only be done with a sequence of several instructions on processor B, that is evidence that B is more RISCy than A.
One definition of RISC is that every instruction should take one cycle and thus any instruction that takes longer is CISC. This led to MIPS not having multiply, for example.
A different definition is that RISC should not have any instructions that could be just as efficiently broken into multiple simpler general-purpose instructions. For example, a memory-register architecture can do a load-and-add in one instruction but RISC prefers separate load and add instructions that take the same time. In this view AES instructions are justified as RISC because implementing an AES round with multiple simple instructions is much slower (6x in Intel's case).
> One definition of RISC is that every instruction should take one cycle and thus any instruction that takes longer is CISC. This led to MIPS not having multiply, for example.
Err, if that was ever really a "RISC" thing, it got dropped quickly. I'm not even sure it's possible to create a sane architecture that runs one cycle per instruction: you need two clock edges just to load and store data from registers, let alone operating on the data. However, optimizing the pipeline so instructions are effectively one cycle makes sense; only one memory cycle per instruction makes sense.
Making AES and SHA instructions doesn't really cohere to any definition of RISC I've ever seen: mostly, use as few instructions as possible because you don't have many opcodes to work with in fixed-size instructions. However, I'm also not opposed to these instructions through some dogmatic belief: I think encryption is important enough these days to be optimized to the greatest possible extent without sacrificing general purpose functionality.
The RISC vs CISC debate has been dead for years. Doubly so ever since we found the limits of scaling clock frequencies ever higher. After all, the RISC movement started as a reaction to the difficulties of scaling the architectures of the day to faster clock frequencies. Now (decades, really) CPU designers are concentrating on doing more work per clock cycle, which is rather anti-RISC. So the only questions that matter are "can we implement this feature efficiently?" and "does this feature provide enough performance or power gain for the implementation cost?"
I don't think it's quite dead yet; the performance/power hit for decoding x86-64 instructions is significant, just to decode to a RISC-like microcode anyway. However, that may be more of a statement about x86-64 than it is about CISC in general. Certainly, the days when CISC made any sense at all, mainly to ease assembly programming, is long gone; remember the 8080's string instructions? Yea, neither does anyone else.
However - I think that x86 is so deeply entrenched, and x86 processors are so refined these days, that the value of the architecture is in the software and the investment in the chip design, not in the architecture itself. I think if the PC industry were to start over again, it would go with some kind of POWER variant.
Regardless of CISC vs RISC, I do agree - SIMD and many-core/stream multiprocessing will make far more difference than the instruction and register flavor used on each core.
Well, the fact that x86 encoding is suboptimal is also a dead debate. If AMD had had the resources of Intel, or if Intel hadn't botched IA-64 so badly and actually licensed it to AMD, x86-64 would have better instruction encoding, no question. (seriously, like all of the unused/slow instructions have 1 byte opcodes)
Anyway, my point is that pure CISC designs (as much as that means anything) obviously lost ages ago. Pure RISC also lost as frequencies plateaued, or perhaps more accurately never really won; CPU designers care about what makes CPUs more performant, not abstract ideology. So we get stuff that runs counter to RISC ideals: SIMD, VLIW, out-of-order execution, and highly specialized instructions like AES and conditionals.
Yes, I agree wholeheartedly. I still think RISC and CISC have value as terms, however vague, because they succinctly summarize trade offs well. I fully realize that today's processors are hybrids of many techniques, and that's a good thing.
>the performance/power hit for decoding x86-64 instructions is significant, just to decode to a RISC-like microcode anyway. However, that may be more of a statement about x86-64 than it is about CISC in general. Certainly, the days when CISC made any sense at all, mainly to ease assembly programming, is long gone
CISC still has an advantage in that it effectively compresses your instruction stream, meaning you can fit more in cache
Yes, its too bad cleaning up the architecture doesn't necessarily cleanup the physical design. As GPUs have been more recent entrants to the general purpose space it is clear they are trying to avoid the same mistakes. The only place you will find a a true GPU binary is buried deep in the memory of the runtime stack (for NVIDIA at least, not sure about AMD).
Right. My understanding is that NVIDIA has mucked around with their low level instructions at every iteration. I remember reading somewhere that with Kepler the hardware doesn't even have dependency interlocks -- the compiler is responsible for scheduling instructions such that they don't use results that aren't ready yet.
But at the same time the lack of a clear specification and backwards compatibility means that the software stack needs to deal with all new bugs (both hardware and software) at every iteration. That puts a IMHO pretty firm cap on the "asymptotic quality" of the stack -- you're constantly chasing bugs until the new version comes out. So you'll never see a GPU toolchain of the quality we expect from gcc (or LLVM, though that isn't quite as mature).
> The one surprise in ARMv8, is the omission of any explicit support for multi-threading. Nearly every other major architecture, x86, MIPS, SPARC, and Power has support for multi-threading and at least one or two multi-threaded implementations.
What does this even mean? Are they talking about atomic operations? Hyperthreading?
I don't see how it could be about hyperthreading, since that's a CPU implementation detail and mostly unrelated to the instruction set. Maybe it's referring to specifying memory consistency behavior and support.
How much does that actually help? In my extremely fuzzy memory, it only worked out to around a 30% increase in ideal situations. I'd rather see them work on features that can be exploited with less voodoo.... like hardware 64-bit support, or SIMD support, or HTM, or hell, clock rate.
Intel HT [1] originally was like that (if your code runs in 1.0s single-threaded, ideally it will run in ~0.77s multi-threaded).
The main problem with hyperthreading is that each CPU generation has been so different and software's only decision is in binding to unique cores and hoping the performance is better. AMD's Bulldozer hasn't helped either.
On the other hand, most of Intel's big markets all tend to use pretty inefficient code (very low IPC), and that's where HT makes a lot of sense. ARM cores are typically running a pretty tight ship. So it makes me laugh when I see Atom includes HT.
I figured Atom had hyperthreading because it was Intel's first in-order x86 core in over a decade, so compilers had forgotten how to schedule x86 code, so there were lots of stalls in the ALUs that a second thread could make good use of. Plus scheduling for Atom is pretty hard in part due to the lack of registers in x86.
Additionally, Ars argues [1] that from a performance per watt perspective, hyperthreading makes more sense with x86 and two cores makes more sense with ARM
You want to read Agner Fog's article, How good is hyperthreading?, http://www.agner.org/optimize/blog/read.php?i=6. As an aside, Agner is one of those rare people who only writes when he has extremely valuable things to say. His entire website is worth a read.
Does he actually answer that question? I read the main post, which seemed to conclude if it's good, it's good, and if it's bad, it's bad. Then there's some replies, and finally a single (negative) number presented for the Rybka chess engine. What about programs that aren't chess engines?
But it's 30% you get for basically free. I kind of thought HT was mostly a gimmick (look, now with 256 virtual CPUs), but changed my mind since it doesn't cost anything (in terms of die space) to add it to a chip. 30% more performance for 1% more cost is a better deal than 100% more performance for 100% more cost, assuming you can live with only 30% more performance.
I should add I think what AMD is doing with Bulldozer (claiming two virtual cores are actually full cores) is bullshit.
> I should add I think what AMD is doing with Bulldozer (claiming two virtual cores are actually full cores) is bullshit.
I think AMD is doing whatever it can to get people to buy its CPUs. If it weren't for their ATI purchase, I think they'd be basically dead by now. It still amazes me how far they've fallen: I built my first computer with an AMD X2 when I was 15 (6 years ago now) - they looked like they were going to upset Intel as deciding the future of x86 chips. They did for a while - we got a sane 64-bit architecture out of it. I'm not sure where they went wrong: was it marketing, was it manufacturing tech, was it profit margins, was it Apple? I don't even know if their current processors are competitive or not in the performance market - things like "Bulldozer" make me think not.
Anyway, could SMT be implemented on top of ARM v8? My knowledge of hardware doesn't include multithreading. However, from my limited understanding of it, I don't see SMT making much difference in tight RISC code, which is designed to have a high instruction throughput per cycle, leaving little for instruction reordering to optimize.
One way to think of SMT is context switches for free, and lots of them. What happens when you run two processes on one core? Every 10ms the kernel copies out all the registers from one process to memory, copies in the regs for the other, and switches. What happens when you use SMT? Every "2" instructions the CPU switches from one process to the other, transparently, without hitting memory. After 20ms, the same amount of work is done, possibly a little more, and if process two only had 1ms of work to do, it doesn't have to wait the full 10ms timeslice of process one.
SMT is not about instruction reordering at all (within one process). Just like the OS switches between processes whenever you wait for disk, now the CPU switches processes whenever you wait for memory. It just happens that virtual cores are the way the OS programs the CPU scheduler.
It's true that AMD is being disingenuous with Bulldozer, but on the other hand their SMT threads share less resources than other implementations (they have separate integer execution units for example, which makes them much closer to "full cores").
A question about how HN works. I'd submitted the same article to HN at a much less opertune time so it fell off the "new" page before it got its first upvotes. [1]
Normally when somebody then resubmits the same article at a better time I thought they had to add a '~' at the end of the URL or something, but I don't see anything like that in this case. So how'd they do it?
(And I should say I'm glad that you all get to see this article, so thank you enos_feedler).
Honest question since I'm confused about terminology: Why is ARMv8 described in places as "backwards compatibility for existing 32-bit software" when some existing instructions will be removed in AArch64?
It looks like an interesting article, so it is a shame that it was split into 5 pages with no way to view everything on one page. I have no recourse but to not read the article at all.
> I have no recourse but to not read the article at all
You could click the next button 4 times and read the full article. There's lots of content on each page. It would've taken 100% less typing than this complaint, and you would've spent that time learning instead of grumbling.
It's a real shame you can't read books either. Whole libraries of documents split into pages with no "view all" button.
There's a big difference in turning a page the size of your hand when you are already holding the book and trying to click a micro button the size of a word when you use the arrow keys to move the browser window.
I'll just wait until the exact same information appears on a single page. I was expressing sincere regret because I liked the first page, but I absolutely will not read paginated articles.
Also: you're obviously irritated by my grumbling, but grumbling about it is just a massive load of hypocrisy, so please realize I'm not going to be taking any of your comments all that seriously.
~ $ curl -s 'http://www.realworldtech.com/arm64/'{1..5}'/' --compressed > a.html; open a.html
("open" is OS X-specific; "nautilus-open" might have a similar function on Linux or something.) Interestingly, that website seems to deliver gzip-compressed output no matter what you request.
Relevant terms are "brace expansion" and "range". And, um, at least for me, the command I wrote works verbatim in zsh. (I think zsh is supposed to be bash-compatible like that.) Brace expansion works like this (in zsh and bash):
In Chrome: typing Ctrl+F <space>2<space> <Esc> <Enter> will get you to the next page, no need to use the mouse if you're concerned about moving your hands.
It scrapes down and combines multi-page articles like this with a click for on or offline reading. Great interface and mobile apps too, I use it all the time.
And, just like in the Intel world, market pressures have introduced all new CISC quirks: AES and SHA256 instructions, for example.
But of course an architecture document does not a circuit make. All the weirdness (old and new) needs to be supported for compatibility (OK, maybe they can drop Jazelle), so the fact that they no longer talk about some things doesn't really save them any transistors in practice.
Honestly, this is sounding more like an Intel or AMD part, not less.