Modern processors do not execute one instruction at a time, which is one of the reasons we rename registers (to recover the original name independent data flow). However a stack machine isn't a significant hindrance to renaming as long as the ISA is otherwise amendable to super-scalar fetch, decode, and rename.
The biggest reason we do not use stack machines today is that it's actually harder for optimizing compilers to generate good code for and it's awkward when joining control-flow from multiple blocks (stacks have to be in the same state).
However, amusingly, almost all modern processors have a small-ish control stack; it's called the RAS, the Return Address Stack predictor and allows the frontend to fetch through calls and returns way before the actual instructions are executed. There's definitely some redundancy here as everything is done again at execution time (and we check that the frontend got it right).
ADD: Knuth's MMIX and SPARC's register windows implement effectively a coarse stack, but both completely miss out on the instruction density that a true stack computer has.
The basic stack engine is also pretty neat, which tracks the value of the stack pointer separately and has it's own adder for it. If all you use is PUSH and POP you almost never have to wait for the value of the stack pointer to be available in the next instruction.
It seems like it would be great for implementing a FORTH if you used the RSP pointer to hold the data stack, but you constantly have to PUSH and POP the return values out of the way, somewhat defeating the RAS. I wonder which is more of a win?
The RAS assumption is that your don’t override the return value, which we verify at execution. As long as you don’t, your code will run fast. If you do, then you’ll take some very expensive pipeline restarts.
But you are right that x86 processors can (they all do?) track stack pointer movement in the frontend. For RISC you can track affine transformations of registers or even just the ABI designated stack pointer but it’s usually not worth the high cost.
They still make good targets for anyone getting their feet wet into compiler design and bytecode formats.
With a good macro assembler it is super easy to translate those stack upcodes into good enough Assembly.
It won't win any performance prices, but it will give a sense of acomplishment, and even provide a path for bootstraping, if wanted.
Then the whole SSA generation, register colouring and what not, can come later if the author is then really interested into deep diving into compilers.
In the 70-90's worked on Burroughs / Unisys B6700 - A Series computers. They are all stack based and they were very interesting how they worked and some very cool things they could do. Languages like Algol, Cobol, Pascal ran really well. Multi dimension arrays took a big hit with needing to do array descriptors, so big Fortran wasn't the best.
Currently it runs as a emulation on Intel hardware. So as a VM it would be the best example out there.
Chapter 3 of this book is pretty useful for that, as each of the instructions have pseudo code descriptions. Worth noting they only pop once for the ADD instruction, not twice!
Hilariously, the system has no logical OR, only AND and XOR and NOT, because "Inclusive-or is rarely needed."
This system was designed by Chuck Moore, father of Forth. Here is an entertaining video of him explaining the F18A stack machine and programming system:
This is such a simple machine. I am planning to make a tiny emulator for my site. One could probably write an emulator in 80 lines of Go (one goroutine for each of the 144 cores).
Aside, Koopman (author of this book) was the expert witness that had to sit in a room and read Toyota's ̶s̶o̶u̶r̶c̶e̶ ̶c̶o̶d̶e̶ for the "Unintended Acceleration" lawsuit.
Code density is the biggest one. If I want to add two numbers with a stack design I have an add opcode. If I have 128 instructions that's 6 bits of opcode. I only need one because the add opcode will pop the two numbers off the top of the stack and push the result back on. If I want to add two numbers with a register design I need to specify which registers which means if I have 16 registers I need 4 bits per operand minimum which means 12 bits just for the two sources and destinations. If I'm trying to write for an embedded system you can't get higher code densities than using a stack machine.
The instructions can also execute faster because I don't have to do things like decode registers and look them up in the register rename table but modern CPUs pipeline this shit so well you never have to worry about instruction latencies.
For register based machines one big advantage is that it's way easier for compilers to optimize for register based systems. If I have a common value that I'm using constantly if I was using a stack based arch I'd have to keep pushing that value to the stack. With a register machine I just slap the value in a register and use it as an operand as many times as I need.
Quite. In practice, a high-performance processor today, either stack or register, will internally have little resemblance to the presented instruction set. The stack will become a huge array of registers that are the inputs and outputs to various computation units, and they will be tracked and renamed as necessary to align with the various slots in the stack, rather than actually moving things around on the stack. Just as the integer register file on a register machine, becomes a huge array of registers that are renamed as necessary. Any difference between them, in terms of hardware implementation, has largely disappeared over the last couple decades.
It's funny because this is the same kind of properties that appealed to me in FP. You compose stuff without having to "encode" parameters. Lighter (unnecessary) information for my brain.
And to a point.. pointfree in haskell is an extreme of that, albeit at some point you start having to remember more information if you stack too many operators.
Performance. It takes longer to access performance critical data on a stack machine compared to a register machine. You spend a lot of cycles rolling the stack to retrieve the value you need.
On the other hand, the circuitry is simpler and implementing a compiler for such an architecture is close to trivial.
Early stack computers, especially from Burroughs, were the first to be mainly programmed in a high level language, assembly was barely used if at all. This was revolutionary at the time.
Only if you're reading directly from memory. Most stack machines will have a scratchpad of 16 values that are on chip much like registers. In the case of something like x87 you can even arbitrarily exchange them.
The biggest reason we do not use stack machines today is that it's actually harder for optimizing compilers to generate good code for and it's awkward when joining control-flow from multiple blocks (stacks have to be in the same state).
However, amusingly, almost all modern processors have a small-ish control stack; it's called the RAS, the Return Address Stack predictor and allows the frontend to fetch through calls and returns way before the actual instructions are executed. There's definitely some redundancy here as everything is done again at execution time (and we check that the frontend got it right).
ADD: Knuth's MMIX and SPARC's register windows implement effectively a coarse stack, but both completely miss out on the instruction density that a true stack computer has.