NVLINK is GPU to GPU or GPU to CPU link, if you are using it as a GPU to CPU lin...

paulmd · on May 8, 2017

Even if you have a super-fast interconnect, you can only fill as fast as something else can send it, right?

I mean, if it's in CPU memory than you are limited to whatever the CPU memory can deliver. Which is 40 GB/s per socket from quad-channel DDR4, more or less. You can probably double that with compression, but it's still vastly less than you'd get from VRAM.

And you get the same speedups from storing the data compressed in VRAM assuming you have some fast method for indexing into the stream, or you're searching the whole stream. But yes, assuming your data can be dehydrated and then stored in CPU memory you can squeeze a little more bandwidth out of it this way.

If it's GPU memory (in another GPU across NVLink/GPUDirect/etc), then you should be doing the searching on that GPU instead.

The stipulation here has always been "tasks that don't fit into GPU memory aren't going to work" so if you are assuming that the data lives on another GPU... then you are violating the pre-condition of all of this discussion, because it fits into GPU memory.

Yes, if your data fits into GPU memory it's very fast, because VRAM bandwidth is enormous. The catch is always getting it to fit into GPU memory. That's why you need a "tiered" system - indexes or important columns live on the GPU, then you return them to the CPU where you do additional processing with them. It's like L1/L2/L3/memory hierarchy on a CPU - sure, your program that fits into cache is super fast, but many real-world tasks don't fit into cache.

If your entire dataset is small enough that it fits in VRAM on one box's worth of GPUs (4-8 per box), or within a few racks or whatever... that's great, go hog wild. But it's not very cost effective versus a tiered approach, this is like asking for a CPU with 16 GB of cache (or enough CPUs to add up to 16 GB of cache). It's outright impossible for very large datasets (96-192 GB is not a very large dataset in this context, but it's a reasonable amount of index space for a 1.5 TB or 3 TB dataset on the CPU socket).

Once you start having to transfer stuff into and out of VRAM, GPU performance typically starts to rapidly degrade. It's "only a second or two" to you... but that's a terabyte's worth of VRAM bandwidth that sat idle for that time. "Rolling" approaches like this do not work very well on GPUs. You want to avoid transferring stuff on or off as much as possible because you just can't do it fast enough.

Computational intensity is one way to avoid doing that, both in VRAM and to host memory. If you can transfer a byte (per core) every P cycles, and you perform at least P cycles' worth of computation on it... you're compute bound. And at that point you can probably make a "rolling" algorithm work. But the problem is doing that (without just being obviously wasteful). In practice, P is like 1 byte every 72 cycles or something (working from memory here). It's really hard to find that much work to do on a byte in many cases. So despite their massive compute power... GPUs are often I/O bound on most tasks. Crypto tasks (hashing) are a notable exception.

Note that I/O bound here can happen in different areas too. You can be I/O bound over the GPU bus, or on the VRAM. It depends on where the piece of data you need lives.

Also, in gaming it's typically the CPU which is bound in draw-calls. The CPU can't assemble the command lists fast enough to saturate the GPU (at least not on a single thread). I'm not sure I'd say it's necessarily I/O bound in this case, either, but it's possible.

Normally GPUs will only use a fraction of their VRAM bandwidth while gaming - however, like anything else, throwing superfluous resources at it will still produce a speedup even if it's not really "the KEY bottleneck". Having memory calls return in fewer cycles will let your GPU get its shaders back to work quicker, even though it means the memory may be idle for a greater percent of time (i.e. utilization continues to fall).

Assuming you were going to do a DB where you store literally everything on the GPU - I'm not sure whether it would be better to go columnar or by row. In a columnar approach you would have one(or more) GPUs per column, and they would either sequentially broadcast their resulting rowIDs and all perform set-intersection on their own datasets, or dump their set of output rowIDs to a single "master" GPU which would find the set intersection. In a row approach, you would have each GPU find "candidate" row-IDs and then dump them to the CPU, and the CPU handles broadcasting and intersecting (since hopefully it's not a large set-intersection).

Columnar might be faster for supported queries, but row would be simpler to implement and more flexible (since the critical data processing would more or be taking place on the CPU).

I would have to play with it to be sure, and I don't have access to a cluster of GPUs anymore (and never had access to one with NVLink or GPUDirect).

dogma1138 · on May 8, 2017

But the AST is effectively a "tiered system" lets say you have a 20TB of DB, 96GB fits in VRAM, 2 fits in your RAM, 17 and change are on your disks.

Today the GPU can see this as a single continuous memory space, you have it does the AST and handles page faults when needed and tries to optimize the execution of each task by prefetching the needed data to the VRAM by it self through branch prediction or by programming the usage hints yourself.

Here is a short paper/write up on the unified memory in Pascal, look at the difference between optimizing/profiling your memory usage and using the prefetch hints, you going from increasing your data set by 4 times of max GPU memory and getting performance reduced by about 3.8-4 times relying on page faults alone to only losing less than 50% of your performance while going over 4 times your maximum available VRAM.

The same thing can be done on cards with 24GB, and the magic is that the performance drop actually levels off at about 2-3 times your maximum available VRAM which is why the difference between 2 and 4 times the amount of memory isn't that big in terms of performance hit/gain to over allocation ratio, so 8 times your amount of memory is not that far behind the penalty you already pay for using 4 times more.

And yes there is no scenario in which you do not lose performance, but the unified memory solution in NVIDIA GPUs is a very good solution to reduce the penalty and if you really need to go balls to the wall you'll actually benefit from the scaling.