Show HN: Software for Remote GPU-over-IP

fire · on Dec 15, 2022

> *Basically, we aren't targeting support for graphical applications running on Linux because there is very little demand for this - but we cover everything else. You CAN run graphical applications on Windows vs. a Linux server.

Ah, disappointing; I was hoping to try this with a Steam Deck as an alternative to using Moonlight and streaming the entire game from the Windows machine.

lopkeny12ko · on Dec 15, 2022

I'm confused, where is the actual source code? This repo only has some Dockerfiles that, as far as I can tell, are pulling precompiled opaque binaries and have some convenience scripts to set up the required runtime environment.

pifm_guy · on Dec 15, 2022

I don't think it's opensource.

I assume it cost quite some $$$ to produce this because you kinda have to cut nvidias binary drivers in half, which is going to require quite a lot of reverse engineering.

lopkeny12ko · on Dec 15, 2022

Having a software showcase or landing page on Github when the underlying software is not actually open source, while completely legitimate, always leaves a really bad taste in my mouth.

stevegolik · on Dec 15, 2022

We started it on GitHub because we're planning to open source part of our code in the next month or two - namely the control plane business logic to determine which clients connect to which servers under what conditions, a k8s integration, etc.

canadacow · on Dec 15, 2022

It's not opensource, though I expect our control plane to be made open source very soon.

We've got some of the best devs working on this and we hope to impress!

JayStavis · on Dec 15, 2022

Serverless GPU is all of the rage in the past month - I'd love to see a review of this from someone who knows how to benchmark a GPU workload.

In particular:

- Autoscaling Stable Diffusion Inference

- Traditional creative workflows (realtime GPU viewport in octane for example)

- Gaming from one GPU in your house to everywhere else

I get the training example for small models but can't imagine it scales that well with model size

The big value seems to be... share 1 GPU to many computers, so spend less on a cluster? Capacity fungibility is real value but hard to measure!

In any case, stuff like this is a good bet. GPU software will continue to increase in prevalence, and utilization will remain low. Solving for the compute market liquidity is important despite NVIDIA's best efforts.

canadacow · on Dec 15, 2022

We have all these running fantastically, please check out our discord where we have clips and and demonstrations of these sorts of workloads. https://discord.gg/2SWbpXx9

fire · on Dec 15, 2022

I'm at the server limit and can't join additional servers without leaving something else. Can you add this stuff to the github repo?

kkielhofner · on Dec 15, 2022

For anything involving inference you’re much better off with one of the many inference model servers such as TensorFlow serving, Triton Inference Server, etc.

en4bz · on Dec 15, 2022

That's the biggest problem with this model. With inference it's better to just use a dedicated model server. For training it's better to deploy on a massive dedicated machine. The only real use case left over is experimentation and debug for devs or students.

JayStavis · on Dec 15, 2022

I don't doubt it at this point in time but can you say more?

I have to imagine a lot of ML infra today is built for Big Dedicated Deployments and not necessarily friendly with more serverless architectures.

That is to say, I'd guess a robust version of this has it's use cases - whether that value prop is in DX, autoscaling, architecture simplification... I'm not sure.

kkielhofner · on Dec 15, 2022

Inference servers essentially turn a model running on CPU and/or GPU hardware into a microservice.

Many of them support the kserve API standard[0] that supports everything from model loading/unloading to (of course) inference requests across models, versions, frameworks, etc.

So in the case of Triton[1] you can have any number of different TensorFlow/torch/tensorrt/onnx/etc models, versions, and variants. You can have one or more Triton instances running on hardware with access to local GPUs (for this example). Then you can put standard REST and or grpc load balancers (or whatever you want) in front of them, hit them via another API, whatever.

Now all your applications need to do to perform inference is do an HTTP POST (or use a client[2]) to a Triton endpoint URL for model input, Triton runs it on a GPU (or CPU if you want), and you get back whatever the model output is. So now everything else in your architecture other than Triton doesn't even know what a GPU or ML is.

It also makes deploying new models, versions, whatever much simpler - you POST them to Triton (or it loads them from S3, local disk, whatever) and they're instantly available everywhere.

Not a sales pitch for Triton but it (like some others) can also do things like dynamic batching with QoS parameters, automated model profiling and performance optimization[3], really granular control over resources, response caching, python middleware for application/biz logic, accelerated media processing with Nvidia DALI, memory management and control, all kinds of stuff.

[0] - https://github.com/kserve/kserve

[1] - https://github.com/triton-inference-server/server

[2] - https://github.com/triton-inference-server/client

[3] - https://github.com/triton-inference-server/model_analyzer

allanrbo · on Dec 14, 2022

It surprises me that this works well enough to be useful. I would have thought that network latency, being orders of magnitude higher than memory latency, would be a huge problem. Latency Numbers Everyone Should Know: https://static.googleusercontent.com/media/sre.google/en//st...

cobertos · on Dec 14, 2022

I'd be surprised if this works for anything latency sensitive over anything more than a LAN.

Even just the time it takes speed of light between NY and LA (410^6m/310^8m/s=1/75s) is roughly how long a 60 fps frame is (1/60s). Add OS serializing the frame from the GPU onto the network card, network switching of those packets, and you're starting to really feel that latency.

denkmoon · on Dec 14, 2022

There are people out there gaming at 30fps with their TV set to Super Duper Image Processing Mode 500ms Latency Edition. Though I suppose these are realistically already served by the cloud gaming offerings.

cobertos · on Dec 15, 2022

Yeah, I've tried with shadow.tech and you can feel the latency. There's enough throughput to get a quality video stream through but there's enough latency to feel annoyed. I only play sandbox games though so I imagine it'd be worse with something competitive.

johanvts · on Dec 14, 2022

The datacenter is probably not thousands, but hundreds of kilometers away so there is room to deliver 60fps. I was surprised how well GeForce Now works.

miovoid · on Dec 15, 2022

With 5G clouds now offers to deploy services to the small datacenters near 5G customers. In some cases less than 1 mile away.

nixpulvis · on Dec 15, 2022

Excuse me but does my 5G cloud have GPUs to spare right now, I could really use some shade.

stingraycharles · on Dec 15, 2022

I don’t understand what 5G has to do with this.

capableweb · on Dec 14, 2022

For gaming, this is obviously a no-go. But for bunch of AI/ML related workloads, it might make perfect sense.

kg · on Dec 15, 2022

GPUs usually run on big command buffers that are generated in a streaming fashion and then submitted at specific points, so it's theoretically possible that a game could hit 60fps this way. You'd just be eating extra latency between command buffer submission and actual rendering.

canadacow · on Dec 15, 2022

Here is a video of DOOM 2016 running at 60 fps. https://discord.com/channels/755570806397993111/755570806397...

bob1029 · on Dec 14, 2022

Not so sure about no-go. The amount of GPU latency in modern AAA titles already approaches 20+ms in the most egregious cases.

Unless there is a need to evict all gpu memory on every frame, I think it is feasible to game on GPUs that live across a very fast LAN.

Melatonic · on Dec 14, 2022

Fast ethernet is getting cheaper than ever - you can easily get 10gb on consumer gear or even 20 and used hardware on I believe 40 or maybe 100 is getting pretty affordable.

wolrah · on Dec 15, 2022

In my experience 40G gear can often be had for cheaper than 10G. I have a pair of Mellanox 40G Infiniband cards that cost me about $20 each on ebay and could be turned in to Ethernet cards with a few commands.

miovoid · on Dec 15, 2022

Output screencast could be encoded right in GPU pipeline

taf2 · on Dec 15, 2022

About 10 years ago I found set operations in ruby were slower then set operations in redis. So I shipped all my data over the network - let redis sort into a sorted set and then crunched my data in redis - retrieving it again over the network in its reduced form… I think it makes sense that for vector operations a remote gpu could be pretty cool. Now if we can get this working from MacBooks to Linux gpus I’d be pretty stoked

delijati · on Dec 14, 2022

PCI-Express 16x 4.0 has 31,5 GByte/s. The fastest fiber ETH has 50 GB/s. So it "could" be useful if you have datacenter grade equipment ;)

AnIdiotOnTheNet · on Dec 14, 2022

Those aren't latency numbers though, they're throughput.

canadacow · on Dec 15, 2022

Since we have full visibility into the pipeline our (Juice labs) Weissman scores are off the charts!

b112 · on Dec 15, 2022

Let me help you.

1) Take off the glass

2) Use a drill, make the connect wider

3) now bundle 9 glass and put in hole

4) now bandwidth is more wide

Fasterness!

miovoid · on Dec 15, 2022

Video and tensors could be compressed before transmission.

lmeyerov · on Dec 15, 2022

Neato, sounds like bitfusion in their early days!

Definitely of interest to us, even w/ latency limits, both for ai dev & investigations and occasional full runs

I do have to wonder about the non-oss licensing, as that's part of why we didn't spend much time on bitfusion...

fock · on Dec 14, 2022

Didn't we have those things already? Virtual-GL and Co. say hi.

Also for most real GPU applications, you need to get the data in and out. I don't think splitting compute across a (insert any non-Infiniband-link) solves this

Melatonic · on Dec 14, 2022

100gbe is pretty similar to infiniband no? Or does infiniband still kill it on latency?

en4bz · on Dec 15, 2022

Infiniband avoids the network stack. Has ~2us latency these days over LAN.

miovoid · on Dec 15, 2022

compress it

xrd · on Dec 14, 2022

I see lots of comments in various ML repositores about trouble running on multiple GPUs. This seems like a great way to run across multiple low VRAM GPUs instead of buying a huge expensive single card. It feels reminiscent of how Google built their clusters on commodity hardware where they would just throw away a failed device rather than trying to fix it. This is really cool.

en4bz · on Dec 15, 2022

I doubt this does multi-server. All the GPUs probably have to be on the same machine.

jasonni · on Dec 16, 2022

Glad to see a https://virtaitech.com/en/index competitor. As I know VirtAI doen't provide freeware. But they provide RDMA network and GPU pooling features. For guys interested in how this is done, I suggest have a look of https://github.com/ut-osa/gpunet and https://github.com/tkestack/vcuda-controller

zamadatix · on Dec 14, 2022

That's really awesome. I'm not sure what I'd use it for but just being able to makes me want to find an excuse! What's impressive is this seems to have more capabilities than most "local" software vGPU solutions for e.g. VMs.

nimitt · on Dec 14, 2022

Do you have any numbers on the viability of using this for ML/AI workloads? seems like once a model is ingested into a gpu vram theoretically the transactional new inputs / outputs would be trivial.

stevegolik · on Dec 15, 2022

For some use cases we're already at parity - YMMV.

ridgered4 · on Dec 15, 2022

Can this be used to accelerate video decode in a linux machine/virtual machine? It sounds like it is not for graphics on linux but it unclear to me where decode falls.

sworley · on Dec 14, 2022

wait is the code actually FOSS or is this just freeware. I only see docker files.

ApolloRising · on Dec 17, 2022

Would this allow a VMware Workstation Linux VM use the GPU from a Windows Host with an Nvidia Video Card for ML usage?

dezmou · on Dec 14, 2022

does it really feel like the GPU I use is one on my machine ? or do I have lot of boilerplate to make it work client side ?

neuronexmachina · on Dec 14, 2022

I haven't tried it yet, but based on their doc it seems like after setting the host in the juice.cfg, you basically just need to run `juicify [application path]`: https://github.com/Juice-Labs/Juice-Labs/wiki/Juice-for-Wind...

imhoguy · on Dec 15, 2022

Cool! "But can it run Crysis?"

en4bz · on Dec 15, 2022

CUDA Driver API or Runtime API remoting?

cwbaker400 · on Dec 15, 2022

Driver.

Mo3 · on Dec 14, 2022

Damn, this is cool. Nice work.

Avlin67 · on Dec 15, 2022

no point without rdma enabled gpus…

yangikan · on Dec 14, 2022

Very nice.