You mean you're not surprised that a machine with 8 GPUs, apparently costing $129k USD (from comment below), can outperform a single CPU? :)
(Of course, a better metric is that it's getting ~56x the performance at probably ~10x the TDP, but that's not surprising for a GPU with the current state of deep learning code.)
To their credit, the thermal and power engineering needed to get that dense a compute deployment is challenging. (bt, dt, have the corpses of power supplies to show for it.) But the price means that it's going to be limited to hyper-dense HPC deployments by companies that don't have the resources to engineer their own for substantially less money, such as Facebook's Big Sur design: https://code.facebook.com/posts/1687861518126048/facebook-to... . And, of course, the academics and hobbyists will continue to use consumer GPUs , which give much better performance/$ but aren't nearly as HPC-friendly.
To be fair, they are comparing it to a dual-socket CPU; which is twice as fair as comparing to a single!!
What I was getting more at was: I want to know the relative performance compared to another 8 Tesla box. I know comparing apples isn't good marketing, but c'mon.
What kind of server pricing are you getting? Base servers are cheap, but add high-end Xeons and memory, not to ignore interconnect and I get something like 7 ok configured 1U servers for $129K (2 20 core w lots of RAM, 10GbE NICs and mirrored boot/swap). No interconnect switching. That's for 20 core Haswell because I don't yet have discount pricing for Broadwell Xeons. I'm sure one could do better at hyperscaler discount but this is startup low-ish quantity.
It looks like it uses a separate daughterboard that houses the GPUs + NVLink, connected to the main motherboard using quad Infiniband EDR (400Gbps) + RDMA. http://images.anandtech.com/doci/10225/SSP_85.JPG
The diagram is confusing, but the GPUs are connected to the NVLink matrix which is connected to the motherboard via the PLX PCIe switches. The quad IB/dual 10GbE are separate IO attached to the motherboard.
though I'm most curious about what motherboard is in there to support NVLink and NVHS.
Good overview of Pascal here: https://devblogs.nvidia.com/parallelforall/inside-pascal/
1 question: will we see NVLink become an open standard for use in/with other coprocessors?
1 gripe: they give relative performance data as compared to a CPU -- of course its faster than a CPU