NVLINK is GPU to GPU or GPU to CPU link, if you are using it as a GPU to CPU link you are taking to the CPU at native NVLINK speeds either through a native NVLINK bus if you are using a POWER CPU which supports it or via an Infiniband/OmniPath interconnect on Intel/AMD CPUs.
16GB/s is a bit tight but that's where prefetches based on pre-defined usage hints as well as the branch predictor in the GPU and the CUDA Compiler/Runtime that feeds the GPU without stalling as much as possible.
OFC you can I/O bound a GPU that's is actually pretty easy these days even in normal (gaming) workloads (e.g. draw call limit).
Let's say you are building a 4 GPU DB, if you are using NVLINK you each GPU can see the entire memory space of the other GPUs connected on the same NVLINK (there is also RDMA which can be done over the network but this is a complicated scenario) in this case you have 96GB of high speed VRAM + which ever amount of memory you can put on your CPU which can be upto 1.5TB or so per socket these days, you also have 64GB/s of bandwidth to feed that 96GB of memory with which means that it takes a second and a half to completely refill the memory completely (and I'm still not sure that the same lossless compression schemes that GPUs use these days for won't work for your data which would give you much higher effective bandwidth that the compression/decompression if free on the GPU side).
With these figures I can't see a reason why you can't optimize your memory residency to have the best of both worlds fast key-value/hash or index lookups for IO bound tasks and then programmatically prefetching the workload or if the branch prediction is good enough just letting it roll in processing bound tasks.
So yes it's not "perfect" but nothing ever will, the question is it much faster or even orders of magnitude faster for certain tasks / workloads / implementations than a CPU only DB and clearly it is.
How well would it compete against other contenders like Intel's Xeon Phis, programmable FPGAs and other emerging technologies that aim to solve this problem from another direction i don't know.
Even if you have a super-fast interconnect, you can only fill as fast as something else can send it, right?
I mean, if it's in CPU memory than you are limited to whatever the CPU memory can deliver. Which is 40 GB/s per socket from quad-channel DDR4, more or less. You can probably double that with compression, but it's still vastly less than you'd get from VRAM.
And you get the same speedups from storing the data compressed in VRAM assuming you have some fast method for indexing into the stream, or you're searching the whole stream. But yes, assuming your data can be dehydrated and then stored in CPU memory you can squeeze a little more bandwidth out of it this way.
If it's GPU memory (in another GPU across NVLink/GPUDirect/etc), then you should be doing the searching on that GPU instead.
The stipulation here has always been "tasks that don't fit into GPU memory aren't going to work" so if you are assuming that the data lives on another GPU... then you are violating the pre-condition of all of this discussion, because it fits into GPU memory.
Yes, if your data fits into GPU memory it's very fast, because VRAM bandwidth is enormous. The catch is always getting it to fit into GPU memory. That's why you need a "tiered" system - indexes or important columns live on the GPU, then you return them to the CPU where you do additional processing with them. It's like L1/L2/L3/memory hierarchy on a CPU - sure, your program that fits into cache is super fast, but many real-world tasks don't fit into cache.
If your entire dataset is small enough that it fits in VRAM on one box's worth of GPUs (4-8 per box), or within a few racks or whatever... that's great, go hog wild. But it's not very cost effective versus a tiered approach, this is like asking for a CPU with 16 GB of cache (or enough CPUs to add up to 16 GB of cache). It's outright impossible for very large datasets (96-192 GB is not a very large dataset in this context, but it's a reasonable amount of index space for a 1.5 TB or 3 TB dataset on the CPU socket).
Once you start having to transfer stuff into and out of VRAM, GPU performance typically starts to rapidly degrade. It's "only a second or two" to you... but that's a terabyte's worth of VRAM bandwidth that sat idle for that time. "Rolling" approaches like this do not work very well on GPUs. You want to avoid transferring stuff on or off as much as possible because you just can't do it fast enough.
Computational intensity is one way to avoid doing that, both in VRAM and to host memory. If you can transfer a byte (per core) every P cycles, and you perform at least P cycles' worth of computation on it... you're compute bound. And at that point you can probably make a "rolling" algorithm work. But the problem is doing that (without just being obviously wasteful). In practice, P is like 1 byte every 72 cycles or something (working from memory here). It's really hard to find that much work to do on a byte in many cases. So despite their massive compute power... GPUs are often I/O bound on most tasks. Crypto tasks (hashing) are a notable exception.
Note that I/O bound here can happen in different areas too. You can be I/O bound over the GPU bus, or on the VRAM. It depends on where the piece of data you need lives.
Also, in gaming it's typically the CPU which is bound in draw-calls. The CPU can't assemble the command lists fast enough to saturate the GPU (at least not on a single thread). I'm not sure I'd say it's necessarily I/O bound in this case, either, but it's possible.
Normally GPUs will only use a fraction of their VRAM bandwidth while gaming - however, like anything else, throwing superfluous resources at it will still produce a speedup even if it's not really "the KEY bottleneck". Having memory calls return in fewer cycles will let your GPU get its shaders back to work quicker, even though it means the memory may be idle for a greater percent of time (i.e. utilization continues to fall).
Assuming you were going to do a DB where you store literally everything on the GPU - I'm not sure whether it would be better to go columnar or by row. In a columnar approach you would have one(or more) GPUs per column, and they would either sequentially broadcast their resulting rowIDs and all perform set-intersection on their own datasets, or dump their set of output rowIDs to a single "master" GPU which would find the set intersection. In a row approach, you would have each GPU find "candidate" row-IDs and then dump them to the CPU, and the CPU handles broadcasting and intersecting (since hopefully it's not a large set-intersection).
Columnar might be faster for supported queries, but row would be simpler to implement and more flexible (since the critical data processing would more or be taking place on the CPU).
I would have to play with it to be sure, and I don't have access to a cluster of GPUs anymore (and never had access to one with NVLink or GPUDirect).
But the AST is effectively a "tiered system" lets say you have a 20TB of DB, 96GB fits in VRAM, 2 fits in your RAM, 17 and change are on your disks.
Today the GPU can see this as a single continuous memory space, you have it does the AST and handles page faults when needed and tries to optimize the execution of each task by prefetching the needed data to the VRAM by it self through branch prediction or by programming the usage hints yourself.
Here is a short paper/write up on the unified memory in Pascal, look at the difference between optimizing/profiling your memory usage and using the prefetch hints, you going from increasing your data set by 4 times of max GPU memory and getting performance reduced by about 3.8-4 times relying on page faults alone to only losing less than 50% of your performance while going over 4 times your maximum available VRAM.
The same thing can be done on cards with 24GB, and the magic is that the performance drop actually levels off at about 2-3 times your maximum available VRAM which is why the difference between 2 and 4 times the amount of memory isn't that big in terms of performance hit/gain to over allocation ratio, so 8 times your amount of memory is not that far behind the penalty you already pay for using 4 times more.
And yes there is no scenario in which you do not lose performance, but the unified memory solution in NVIDIA GPUs is a very good solution to reduce the penalty and if you really need to go balls to the wall you'll actually benefit from the scaling.
16GB/s is a bit tight but that's where prefetches based on pre-defined usage hints as well as the branch predictor in the GPU and the CUDA Compiler/Runtime that feeds the GPU without stalling as much as possible.
OFC you can I/O bound a GPU that's is actually pretty easy these days even in normal (gaming) workloads (e.g. draw call limit).
Let's say you are building a 4 GPU DB, if you are using NVLINK you each GPU can see the entire memory space of the other GPUs connected on the same NVLINK (there is also RDMA which can be done over the network but this is a complicated scenario) in this case you have 96GB of high speed VRAM + which ever amount of memory you can put on your CPU which can be upto 1.5TB or so per socket these days, you also have 64GB/s of bandwidth to feed that 96GB of memory with which means that it takes a second and a half to completely refill the memory completely (and I'm still not sure that the same lossless compression schemes that GPUs use these days for won't work for your data which would give you much higher effective bandwidth that the compression/decompression if free on the GPU side).
With these figures I can't see a reason why you can't optimize your memory residency to have the best of both worlds fast key-value/hash or index lookups for IO bound tasks and then programmatically prefetching the workload or if the branch prediction is good enough just letting it roll in processing bound tasks.
So yes it's not "perfect" but nothing ever will, the question is it much faster or even orders of magnitude faster for certain tasks / workloads / implementations than a CPU only DB and clearly it is. How well would it compete against other contenders like Intel's Xeon Phis, programmable FPGAs and other emerging technologies that aim to solve this problem from another direction i don't know.