I try to refrain from purely negative commentary on articles, but yuck! How can one hope to say anything useful about "memory performance" without once using the terms "latency" or "bandwidth"? Memory performance is getting higher in the same way that processor performance is going up:
Latency is to frequency as bandwidth is to parallelism. Single core CPU frequencies are relatively stable, but parallelism offers the opportunity to get more done each cycle. Latency for random access is holding quite steady, but caches are getting bigger and faster, and bandwidth is going up.
The key is figuring out how to write software that takes advantage of spatial locality and available bandwidth rather than getting choked by the latency. This is hard in the same way that taking advantage of multiple cores is hard: it requires a different approach, but is not a fundamental limitation.
Lots of memory access isn't the problem. Long latency isn't the problem. The problem is designing your program so that it generates lots of long latency memory accesses and then grinds to a halt in the presence of this latency. I think the summary from this recent paper is spot on:
Our conclusion is contrary to our expectations and to
previous findings and goes against conventional wisdom,
which states that accesses to RAM are slow, and should be
minimized. A more accurate statement, that accounts for our
findings, is that accesses to RAM have high latency and
this latency needs to be mitigated.
> I try to refrain from purely negative commentary on articles, but yuck! How can one hope to say anything useful about "memory performance" without once using the terms "latency" or "bandwidth"?
Yeah, it is pretty clear the author didn't really understand much of the topic. Even if you have lightning fast memory, there's still the issue of connecting it. Long traces and multiple levels of multiplexing are going to add latency no matter how amazing memory tech is being employed. Before long, to increase bandwidth, memory needs to be integrated in the same package as the CPU if not on the same die. At least large eDRAM-style 'caches'. You can't economically run 1024 traces just for memory on the mainboard PCB!
> The key is figuring out how to write software that takes advantage of spatial locality and available bandwidth rather than getting choked by the latency. This is hard in the same way that taking advantage of multiple cores is hard: it requires a different approach, but is not a fundamental limitation.
Indeed, of course this requires skilled labor for now, until compilers catch up one day. And unfortunately code that truly requires random access is just not going to perform well on modern hardware. Up until early nineties, memory was faster than processors. Random access was just fine. Not so anymore.
It's also easy to get bandwidth limited with SSE and AVX. Although line fill etc. buffers seem to often bottleneck first per CPU core. For scalar code, being bandwidth limited is not going to happen.
The issue nowadays is that machines are so unique snowflakes performance and configuration wise. It's not hard to max out a single configuration, but it is hard to write something that performs decently across different system configurations. There are like 30 instruction set extensions for x86, variations in reorder buffer depth, variations in cache latency, size and associativity. And of course memory bus configurations, number and interleaving of memory channels, NUMA, DRAM memory page size (1, 2, 4 kB), etc.
For example unlike 8-way associative L2 cache on Sandy Bridge, Haswell, etc., on Skylake it is now 4-way instead. Code that is optimised for 8-way L2 cache might be pathologically invalidating L2 cache lines on Skylake.
Those aspects matter, because high performance is often a balancing act between available features, bandwidth and CPU power.
And while you're at it, it's hard to figure out what's going on in your system. I've spent some time looking, and have not yet found anything resembling a bandwidth monitor for main memory.
It's possible, but depending on your OS and processor it might be quite difficult. There are "Uncore" performance counters for this, but the means of access has been changing from generation to generation and the software sometimes lags behind.
Just as a side note - after working at an RF Networking company for 10+ years, the phrase "bandwidth" (which in RF, is literally the "width of the band", e.g. 200 kHz ) is endlessly confusing, particularly when you hear engineers use phrases like, "we've got more bandwidth resulting in greater range but lower data rate with this signaling scheme."
The RF PHY/MAC engineers use data rate, and bandwidth 100% consistently, and everyone else I've ever met, uses "bandwidth" as a synonym for "data rate" - I'm surprised there isn't more confusion between the two groups of people.
At least all groups measures data rate and "bandwidth" with megabits/sec == 1x10e6 bits/second. Thankfully that confusion never entered the data comms world.
There is nothing hard about this. You either have a "problem" that can take advantage of multiple cores or sequential reading from memory or you don't. The only way to avoid this is to change the "problem"
A much bigger problem "holding up Moore's Law Progression" is the failure of Dennard scaling and the fact that voltage scaling is hitting the threshold voltage limit (where sub-threshold leakage current increases significantly) as we move to smaller technology nodes. This means we can build bigger chips but we don't necessarily have the power budget to power up all parts of it at the same time (these could be cores, pipeline structures, etc). The architecture community has written a lot on this "Dark Silicon" problem if anyone wants to read further.
I'm not certain who this article is targeting, but it reminds of an acquaintance who would ask me detailed technical questions, misunderstand all my answers, and then two days later misinform me of the things I told him as if he were teaching me.
Moore's 1965 paper was about transistor count doubling every 12 months rather than frequency but "Moore's Law" didn't come into use as a term until 1975, by which time Moore was giving shrinking feature sizes equal billing[1]. And Moore himself wrote a memo endorsing a more general use of the term "Moore's Law" for any chip performance metric that doubles regularly. I can't find a copy online but I have it in my old Computer Architecture lecture notes. And performance and scaling were identical as long as Dennard scaling[2] lasted.
In the past, the increase of transistor count and density meant that a CPU could be designed to run at a higher frequency and have cleverer circuitry which allowed it to do more each cycle. This is the source of the misconception that Moore's Law is directly tied to performance.
Looking at a modern Intel chip, the CPU cores take about 30% of the area, with the rest of the transistors spent elsewhere (L3 cache, integrated graphics, system and memory controllers).
When you're talking about memory performance you always have to include the latency, the bandwidth, and the size of the memory pool. 64kb memory pools have scaled in latency and bandwidth at the same rate as processing power - now they're sitting deep in the heart of the chip as the L1 cache.
Server CPUs have >2x memory channels when compared to consumer CPUs. IBM Power CPUs show that it's possible to get even more memory bandwidth than in mainstream Xeons.
Looks like low RAM bandwidth in consumer CPUs is a mostly artificial differentiator to discourage use of these parts in servers.
On the other hand there are HMC and HBM technologies that offer order of magnitude more bandwidth and several times less latency. They are already used in AMD gpus as well as in prototypes of Nvidia's pascal gpu and Intel's Knight's corner 60-core cpu http://www.theplatform.net/2015/03/25/more-knights-landing-x...
I hope HBM comes to consumer CPUs too, but with current lack of competition in the market it can take a long time.
Since skylake comes in both DDR4 and DDR3 motherboard versions, various benchmarks have come out to test the difference for the state-of-the-art 14nm cpu with the "improved" memory vs the "old" memory.
And the difference is often only 1-2%
Unless maybe the goal should be to put 32gb of memory right on the cpu die
Unsurprisingly, if the software in question is not bandwidth limited, providing more bandwidth is not going to speed it up. Most software is like that.
It also truly depends on how it was optimized. The software being tested was likely optimized for previous gen configurations. It might very well favor a bit more computation over higher memory bandwidth usage.
Wait until developers optimize against DDR4 Skylake systems, you might start to see 5-10% difference at that point. Truly bandwidth limited code can run up to about 40% faster on a DDR4 system, assuming typical 1600 MHz DDR3 and 2400 MHz DDR4.
Okay, do the usual: Add microcode
to the processor cores to support
more capable instructions so that
can trigger a streams of all the
data with nearly no time for addressing.
E.g., implement heap sift, heap sort,
heap priority queue, substring search,
of course, inner product accumulation,
and whatever else looks promising,
e.g., standard multi-dimensional
array addressing, chasing down
chains of pointers common in
OO programming.
That is, have the machine instructions
do larger chunks of work.
I'm guessing this is downvoted because it's no longer a viable solution. Microcode generates multiple µops with a single instruction, but the decoded µop cache is large enough (and efficient enough) that the decoding is almost never the bottleneck.
Worse, for anything in a loop it often actually slows things down by preventing the usual caching mechanisms from working.
The instructions/µops are already where they need to be, but data dependencies prevent them from being executed in a timely manner.
What's needed instead are changes to the algorithms that allow for more instruction level parallelism. We need to overcome latency by creating assembly lines within the core rather than having each core do piece work.
I thought that what I wrote was clear,
simple, and drew heavily from some
quite solid, old ideas that, however,
don't seem to be popular now
but do address the same old problem
of main memory being too slow.
I don't know
where I was unclear.
> I'm guessing this is downvoted because it's no longer a viable solution. Microcode generates multiple µops with a single instruction, but the decoded µop cache is large enough (and efficient enough) that the decoding is almost never the bottleneck.
Sure. But the OP was talking about
memory speed, not internal processor
speed, from microcode or anything else.
By saying microcode, I was just trying
to make the needed logic obviously
doable. Now transistors are so
cheap could do it in hardware.
The main point I was trying to get
at was just the one in the OP --
memory too slow.
Well, memory can be darned fast, if
talking just the memory. The way I
see it, it's not that the memory
itself is or currently has to be
too slow electronically and, instead,
it's that the darned addressing is
too slow or there's too much of it.
E.g., to access a Fortran
array with three subscripts, have to
do the darned array calculation --
what is it, two multiplies and two
adds starting with five numbers --
for each element of the array.
Can spend more time calculating
the address of the array component
than spend on the data when get it.
Yes,
a decent Fortran compiler will
not do that arithmetic just from
the beginning for each component
of the array, especially in a loop.
Since C can't do such arrays without
the programmer writing a macro,
I have to wonder if C compilers
are smart enough to
save on the array addressing
arithmetic like a Fortran compiler does.
Still, commonly,
spend more time calculating addresses
than doing the work. And to the
processor core, the address calculations
look just like just more instructions
that might be part of something
really complicated instead of something
that has some simple patterns that
can be exploited -- the hardware
designer would see the patterns
and exploit them in the hardware.
So, the poor processor has to
be absurdly myopic and just
do what the heck it is being told
to do.
Instead, for cases, say, heap sort,
the fast Fourier transform, and more,
just have just one instruction for
heap sort and, then,
have all that addressing logic
in hardware and have it fast
enough to keep memory fully
busy. If electronically memory
is still too slow, then have
interleaved memory -- since
the addressing is so simple
and regular, the hardware implementation
will know how to
look ahead, much as in speculative
execution now except there will
be less or no speculating.
Some of this is now very old stuff
and for just the reasons I suggested:
So, super computing has long
had an inner product instruction --
one instruction and get the whole
inner product calculation done.
That is, an inner product is the
sum on i of x(i)y(i) and is just
ubiquitous in scientific-engineering
computing.
That is, generally the idea is to move
some relatively simple ordinary
instruction streams into
hardware. Again, the idea
is old, e.g., was used for the
instruction extensions for
handling images -- one instruction
and, slam, bam, thank you ma'am,
got some image processing code, that
was maybe before 100 instructions,
in a loop, done. So, get to save
on fetching and decoding all
those instructions
and much of the addressing arithmetic
they would do, and the addressing is
so regular that the hardware gets to
look ahead, e.g., which would
exploit interleaved
memory. And, in addition, might design
the sending of read commands to main
memory not just one at a time but
as a list, boom, and with no more
attention, waiting, synchronizing,
hand shaking, the memory delivers
all the data at all the addresses
in the list. E.g., something
like DMA for I/O. E.g., to find a
sum of the numbers in an array,
have a single instruction and
have memory just send the
data ASAP, much like in DMA
for fast I/O --
on a machine with
interleaved memory, say, 16 ways,
that would just fly and scream
at the same time.
For instruction level parallelism,
some old work showed that with
24 way very long instruction word
(VLIW) and just some compiler tweaks
on ordinary code, could get 9:1
speedup. IIRC, Itanium was supposed
to be a VLIW machine.
I don't think you were unclear. As respectively as possible (and I really do like your perspective and many of your other posts) I think the problem is that you were clear and wrong[1].
> it's that the darned addressing is too slow or
> there's too much of it
Generally, no. Current processors have two dedicated address calculation ports that each calculate (ptr + index*size + const) in the same cycle that the request is issued. Separately, there are almost always unused arithmetic ports such that one could easily double the amount of other arithmetic without adding any additional latency. Address calculation is not a significant performance factor.
> So, get to save on fetching and decoding all those
> instructions and much of the addressing arithmetic
> they would do
My argument is that these are almost never a bottleneck, and that removing them altogether will not produce a significant speed up.
> If electronically memory is still too slow, then have
> interleaved memory -- since the addressing is so simple
> and regular, the hardware implementation will know how
> to look ahead, much as in speculative execution now
> except there will be less or no speculating.
Recent generations have been 2-, 3-, or 4-way interleaved. 6-way is promised for a (I think) 2018. Hardware prefetchers are excellent at getting out ahead of just about any regular pattern. The issue is that most software is designed to require unpredictable access patterns, and thus is latency sensitive.
> And, in addition, might design the sending of read
> commands to main memory not just one at a time but
> as a list, boom, and with no more attention, waiting,
> synchronizing, hand shaking, the memory delivers all
> the data at all the addresses in the list.
This is essentially the 'gather' instruction that has been supported in the last 3 generations of Intel processors. It's a single instruction that sends out parallel requests for 8 addresses and returns 8 32-bit values with about the same latency as a single request would take. To an first approximation, it's never used[2] to advantage, because it's no faster than issuing 8 requests serially while being less flexible.
Mostly, I'm suggesting that hardware currently supports far more performance than we are getting, because modern software is still designed as it it was running on in-order processors with a flat memory hierarchy. I agree with you that C (and other) compilers could do a better job, but mostly I think the issue is with the mindset of the programmers who are making it.
[1] I find that clear and wrong however is almost always preferable to unclear and wrong. And I reiterate: I don't mean this as an attack.
[2] I love it in concept, and hope to show that it actually can speed things up on the latest Skylake generation, but until now it's been mostly a bust.
Clearly most of the processor
design features I outlined
and proposed were speculative.
E.g., at IBM's Watson lab, in the
room next to mine a guy was looking
at traces from deep in the processor
hardware to estimate the speedups
possible from various proposed and
speculative features. He did this
work because it is not easy to
estimate, just from intuition, what
features will give what speedups.
The VLIW data I reported was from
some careful work by a guy across
the hall from me.
Net, what I was proposing was not
just wild guessing.
So, you are explaining that
some of what I proposed
has already been implemented.
http://www.corsair.com/en-us/blog/2015/september/ddr3_vs_ddr...
Latency is to frequency as bandwidth is to parallelism. Single core CPU frequencies are relatively stable, but parallelism offers the opportunity to get more done each cycle. Latency for random access is holding quite steady, but caches are getting bigger and faster, and bandwidth is going up.
The key is figuring out how to write software that takes advantage of spatial locality and available bandwidth rather than getting choked by the latency. This is hard in the same way that taking advantage of multiple cores is hard: it requires a different approach, but is not a fundamental limitation.
Lots of memory access isn't the problem. Long latency isn't the problem. The problem is designing your program so that it generates lots of long latency memory accesses and then grinds to a halt in the presence of this latency. I think the summary from this recent paper is spot on:
http://arxiv.org/pdf/1509.05053v1.pdf