Memory Is Holding Up the Moore's Law Progression of Processing Power (2014)

nkurz · on Oct 19, 2015

I try to refrain from purely negative commentary on articles, but yuck! How can one hope to say anything useful about "memory performance" without once using the terms "latency" or "bandwidth"? Memory performance is getting higher in the same way that processor performance is going up:

http://www.corsair.com/en-us/blog/2015/september/ddr3_vs_ddr...

Latency is to frequency as bandwidth is to parallelism. Single core CPU frequencies are relatively stable, but parallelism offers the opportunity to get more done each cycle. Latency for random access is holding quite steady, but caches are getting bigger and faster, and bandwidth is going up.

The key is figuring out how to write software that takes advantage of spatial locality and available bandwidth rather than getting choked by the latency. This is hard in the same way that taking advantage of multiple cores is hard: it requires a different approach, but is not a fundamental limitation.

Lots of memory access isn't the problem. Long latency isn't the problem. The problem is designing your program so that it generates lots of long latency memory accesses and then grinds to a halt in the presence of this latency. I think the summary from this recent paper is spot on:

  Our conclusion is contrary to our expectations and to   
  previous findings and goes against conventional wisdom, 
  which states that accesses to RAM are slow, and should be 
  minimized. A more accurate statement, that accounts for our 
  findings, is that accesses to RAM have high latency and 
  this latency needs to be mitigated.

http://arxiv.org/pdf/1509.05053v1.pdf

vardump · on Oct 19, 2015

> I try to refrain from purely negative commentary on articles, but yuck! How can one hope to say anything useful about "memory performance" without once using the terms "latency" or "bandwidth"?

Yeah, it is pretty clear the author didn't really understand much of the topic. Even if you have lightning fast memory, there's still the issue of connecting it. Long traces and multiple levels of multiplexing are going to add latency no matter how amazing memory tech is being employed. Before long, to increase bandwidth, memory needs to be integrated in the same package as the CPU if not on the same die. At least large eDRAM-style 'caches'. You can't economically run 1024 traces just for memory on the mainboard PCB!

> The key is figuring out how to write software that takes advantage of spatial locality and available bandwidth rather than getting choked by the latency. This is hard in the same way that taking advantage of multiple cores is hard: it requires a different approach, but is not a fundamental limitation.

Indeed, of course this requires skilled labor for now, until compilers catch up one day. And unfortunately code that truly requires random access is just not going to perform well on modern hardware. Up until early nineties, memory was faster than processors. Random access was just fine. Not so anymore.

It's also easy to get bandwidth limited with SSE and AVX. Although line fill etc. buffers seem to often bottleneck first per CPU core. For scalar code, being bandwidth limited is not going to happen.

The issue nowadays is that machines are so unique snowflakes performance and configuration wise. It's not hard to max out a single configuration, but it is hard to write something that performs decently across different system configurations. There are like 30 instruction set extensions for x86, variations in reorder buffer depth, variations in cache latency, size and associativity. And of course memory bus configurations, number and interleaving of memory channels, NUMA, DRAM memory page size (1, 2, 4 kB), etc.

For example unlike 8-way associative L2 cache on Sandy Bridge, Haswell, etc., on Skylake it is now 4-way instead. Code that is optimised for 8-way L2 cache might be pathologically invalidating L2 cache lines on Skylake.

Those aspects matter, because high performance is often a balancing act between available features, bandwidth and CPU power.

Filligree · on Oct 19, 2015

And while you're at it, it's hard to figure out what's going on in your system. I've spent some time looking, and have not yet found anything resembling a bandwidth monitor for main memory.

nkurz · on Oct 19, 2015

It's possible, but depending on your OS and processor it might be quite difficult. There are "Uncore" performance counters for this, but the means of access has been changing from generation to generation and the software sometimes lags behind.

For Linux and Haswell EP, I've had luck with likwid: https://github.com/RRZE-HPC/likwid/wiki/Haswell-EP#memory-co...

You also might have luck with Andi Kleen's pmu-tools: https://github.com/andikleen/pmu-tools/blob/master/ucevent/R...

I don't know for sure if Intel's VTune supports these counters, but I'd presume it would: https://software.intel.com/en-us/intel-vtune-amplifier-xe

Filligree · on Oct 24, 2015

Doesn't seem like any of them work with my E3 Xeon.

ghshephard · on Oct 19, 2015

Just as a side note - after working at an RF Networking company for 10+ years, the phrase "bandwidth" (which in RF, is literally the "width of the band", e.g. 200 kHz ) is endlessly confusing, particularly when you hear engineers use phrases like, "we've got more bandwidth resulting in greater range but lower data rate with this signaling scheme."

The RF PHY/MAC engineers use data rate, and bandwidth 100% consistently, and everyone else I've ever met, uses "bandwidth" as a synonym for "data rate" - I'm surprised there isn't more confusion between the two groups of people.

At least all groups measures data rate and "bandwidth" with megabits/sec == 1x10e6 bits/second. Thankfully that confusion never entered the data comms world.

h1290799 · on Oct 19, 2015

There is nothing hard about this. You either have a "problem" that can take advantage of multiple cores or sequential reading from memory or you don't. The only way to avoid this is to change the "problem"

MichaelGG · on Oct 19, 2015

Many times you'll have code that doesn't take advantage of multiple chores and sequential memory, but could. Redesigning it is what's hard.

e5f34f89 · on Oct 19, 2015

A much bigger problem "holding up Moore's Law Progression" is the failure of Dennard scaling and the fact that voltage scaling is hitting the threshold voltage limit (where sub-threshold leakage current increases significantly) as we move to smaller technology nodes. This means we can build bigger chips but we don't necessarily have the power budget to power up all parts of it at the same time (these could be cores, pipeline structures, etc). The architecture community has written a lot on this "Dark Silicon" problem if anyone wants to read further.

oldmanjay · on Oct 19, 2015

I'm not certain who this article is targeting, but it reminds of an acquaintance who would ask me detailed technical questions, misunderstand all my answers, and then two days later misinform me of the things I told him as if he were teaching me.

advancedprivacy · on Oct 19, 2015

Moore's Law is about transistor count. Not about frequency, bandwidth or "speed"... Is everybody ignoring this fact?

Symmetry · on Oct 19, 2015

Moore's 1965 paper was about transistor count doubling every 12 months rather than frequency but "Moore's Law" didn't come into use as a term until 1975, by which time Moore was giving shrinking feature sizes equal billing[1]. And Moore himself wrote a memo endorsing a more general use of the term "Moore's Law" for any chip performance metric that doubles regularly. I can't find a copy online but I have it in my old Computer Architecture lecture notes. And performance and scaling were identical as long as Dennard scaling[2] lasted.

[1]http://www.eng.auburn.edu/~agrawvd/COURSE/E7770_Spr07/READ/G...

[2]https://en.wikipedia.org/wiki/Dennard_scaling

bluetomcat · on Oct 19, 2015

In the past, the increase of transistor count and density meant that a CPU could be designed to run at a higher frequency and have cleverer circuitry which allowed it to do more each cycle. This is the source of the misconception that Moore's Law is directly tied to performance.

Looking at a modern Intel chip, the CPU cores take about 30% of the area, with the rest of the transistors spent elsewhere (L3 cache, integrated graphics, system and memory controllers).

Symmetry · on Oct 19, 2015

When you're talking about memory performance you always have to include the latency, the bandwidth, and the size of the memory pool. 64kb memory pools have scaled in latency and bandwidth at the same rate as processing power - now they're sitting deep in the heart of the chip as the L1 cache.

anonmeow · on Oct 19, 2015

Server CPUs have >2x memory channels when compared to consumer CPUs. IBM Power CPUs show that it's possible to get even more memory bandwidth than in mainstream Xeons. Looks like low RAM bandwidth in consumer CPUs is a mostly artificial differentiator to discourage use of these parts in servers.

On the other hand there are HMC and HBM technologies that offer order of magnitude more bandwidth and several times less latency. They are already used in AMD gpus as well as in prototypes of Nvidia's pascal gpu and Intel's Knight's corner 60-core cpu http://www.theplatform.net/2015/03/25/more-knights-landing-x...

I hope HBM comes to consumer CPUs too, but with current lack of competition in the market it can take a long time.

ck2 · on Oct 19, 2015

No it's not.

Since skylake comes in both DDR4 and DDR3 motherboard versions, various benchmarks have come out to test the difference for the state-of-the-art 14nm cpu with the "improved" memory vs the "old" memory.

And the difference is often only 1-2%

Unless maybe the goal should be to put 32gb of memory right on the cpu die

vardump · on Oct 19, 2015

Different software needs different resources.

Unsurprisingly, if the software in question is not bandwidth limited, providing more bandwidth is not going to speed it up. Most software is like that.

It also truly depends on how it was optimized. The software being tested was likely optimized for previous gen configurations. It might very well favor a bit more computation over higher memory bandwidth usage.

Wait until developers optimize against DDR4 Skylake systems, you might start to see 5-10% difference at that point. Truly bandwidth limited code can run up to about 40% faster on a DDR4 system, assuming typical 1600 MHz DDR3 and 2400 MHz DDR4.

graycat · on Oct 19, 2015

Okay, do the usual: Add microcode to the processor cores to support more capable instructions so that can trigger a streams of all the data with nearly no time for addressing.

E.g., implement heap sift, heap sort, heap priority queue, substring search, of course, inner product accumulation, and whatever else looks promising, e.g., standard multi-dimensional array addressing, chasing down chains of pointers common in OO programming.

That is, have the machine instructions do larger chunks of work.

nkurz · on Oct 19, 2015

I'm guessing this is downvoted because it's no longer a viable solution. Microcode generates multiple µops with a single instruction, but the decoded µop cache is large enough (and efficient enough) that the decoding is almost never the bottleneck.

Worse, for anything in a loop it often actually slows things down by preventing the usual caching mechanisms from working. The instructions/µops are already where they need to be, but data dependencies prevent them from being executed in a timely manner.

What's needed instead are changes to the algorithms that allow for more instruction level parallelism. We need to overcome latency by creating assembly lines within the core rather than having each core do piece work.

graycat · on Oct 20, 2015

I thought that what I wrote was clear, simple, and drew heavily from some quite solid, old ideas that, however, don't seem to be popular now but do address the same old problem of main memory being too slow. I don't know where I was unclear.

> I'm guessing this is downvoted because it's no longer a viable solution. Microcode generates multiple µops with a single instruction, but the decoded µop cache is large enough (and efficient enough) that the decoding is almost never the bottleneck.

Sure. But the OP was talking about memory speed, not internal processor speed, from microcode or anything else.

By saying microcode, I was just trying to make the needed logic obviously doable. Now transistors are so cheap could do it in hardware.

The main point I was trying to get at was just the one in the OP -- memory too slow.

Well, memory can be darned fast, if talking just the memory. The way I see it, it's not that the memory itself is or currently has to be too slow electronically and, instead, it's that the darned addressing is too slow or there's too much of it.

E.g., to access a Fortran array with three subscripts, have to do the darned array calculation -- what is it, two multiplies and two adds starting with five numbers -- for each element of the array. Can spend more time calculating the address of the array component than spend on the data when get it. Yes, a decent Fortran compiler will not do that arithmetic just from the beginning for each component of the array, especially in a loop. Since C can't do such arrays without the programmer writing a macro, I have to wonder if C compilers are smart enough to save on the array addressing arithmetic like a Fortran compiler does.

Still, commonly, spend more time calculating addresses than doing the work. And to the processor core, the address calculations look just like just more instructions that might be part of something really complicated instead of something that has some simple patterns that can be exploited -- the hardware designer would see the patterns and exploit them in the hardware. So, the poor processor has to be absurdly myopic and just do what the heck it is being told to do.

Instead, for cases, say, heap sort, the fast Fourier transform, and more, just have just one instruction for heap sort and, then, have all that addressing logic in hardware and have it fast enough to keep memory fully busy. If electronically memory is still too slow, then have interleaved memory -- since the addressing is so simple and regular, the hardware implementation will know how to look ahead, much as in speculative execution now except there will be less or no speculating.

Some of this is now very old stuff and for just the reasons I suggested: So, super computing has long had an inner product instruction -- one instruction and get the whole inner product calculation done. That is, an inner product is the sum on i of x(i)y(i) and is just ubiquitous in scientific-engineering computing.

That is, generally the idea is to move some relatively simple ordinary instruction streams into hardware. Again, the idea is old, e.g., was used for the instruction extensions for handling images -- one instruction and, slam, bam, thank you ma'am, got some image processing code, that was maybe before 100 instructions, in a loop, done. So, get to save on fetching and decoding all those instructions and much of the addressing arithmetic they would do, and the addressing is so regular that the hardware gets to look ahead, e.g., which would exploit interleaved memory. And, in addition, might design the sending of read commands to main memory not just one at a time but as a list, boom, and with no more attention, waiting, synchronizing, hand shaking, the memory delivers all the data at all the addresses in the list. E.g., something like DMA for I/O. E.g., to find a sum of the numbers in an array, have a single instruction and have memory just send the data ASAP, much like in DMA for fast I/O -- on a machine with interleaved memory, say, 16 ways, that would just fly and scream at the same time.

For instruction level parallelism, some old work showed that with 24 way very long instruction word (VLIW) and just some compiler tweaks on ordinary code, could get 9:1 speedup. IIRC, Itanium was supposed to be a VLIW machine.

nkurz · on Oct 20, 2015

  > I don't know where I was unclear.

I don't think you were unclear. As respectively as possible (and I really do like your perspective and many of your other posts) I think the problem is that you were clear and wrong[1].

  > it's that the darned addressing is too slow or 
  > there's too much of it

Generally, no. Current processors have two dedicated address calculation ports that each calculate (ptr + index*size + const) in the same cycle that the request is issued. Separately, there are almost always unused arithmetic ports such that one could easily double the amount of other arithmetic without adding any additional latency. Address calculation is not a significant performance factor.

  > So, get to save on fetching and decoding all those 
  > instructions and much of the addressing arithmetic 
  > they would do

My argument is that these are almost never a bottleneck, and that removing them altogether will not produce a significant speed up.

  > If electronically memory is still too slow, then have 
  > interleaved memory -- since the addressing is so simple
  > and regular, the hardware implementation will know how
  > to look ahead, much as in speculative execution now 
  > except there will be less or no speculating.

Recent generations have been 2-, 3-, or 4-way interleaved. 6-way is promised for a (I think) 2018. Hardware prefetchers are excellent at getting out ahead of just about any regular pattern. The issue is that most software is designed to require unpredictable access patterns, and thus is latency sensitive.

  > And, in addition, might design the sending of read
  > commands to main memory not just one at a time but
  > as a list, boom, and with no more attention, waiting, 
  > synchronizing, hand shaking, the memory delivers all
  > the data at all the addresses in the list.

This is essentially the 'gather' instruction that has been supported in the last 3 generations of Intel processors. It's a single instruction that sends out parallel requests for 8 addresses and returns 8 32-bit values with about the same latency as a single request would take. To an first approximation, it's never used[2] to advantage, because it's no faster than issuing 8 requests serially while being less flexible.

Mostly, I'm suggesting that hardware currently supports far more performance than we are getting, because modern software is still designed as it it was running on in-order processors with a flat memory hierarchy. I agree with you that C (and other) compilers could do a better job, but mostly I think the issue is with the mindset of the programmers who are making it.

[1] I find that clear and wrong however is almost always preferable to unclear and wrong. And I reiterate: I don't mean this as an attack.

[2] I love it in concept, and hope to show that it actually can speed things up on the latest Skylake generation, but until now it's been mostly a bust.

graycat · on Oct 25, 2015

Clearly most of the processor design features I outlined and proposed were speculative.

E.g., at IBM's Watson lab, in the room next to mine a guy was looking at traces from deep in the processor hardware to estimate the speedups possible from various proposed and speculative features. He did this work because it is not easy to estimate, just from intuition, what features will give what speedups.

The VLIW data I reported was from some careful work by a guy across the hall from me.

Net, what I was proposing was not just wild guessing.

So, you are explaining that some of what I proposed has already been implemented.

Okay.