Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Software for Remote GPU-over-IP (github.com/juice-labs)
138 points by stevegolik on Dec 14, 2022 | hide | past | favorite | 56 comments
We built installable software for Windows & Linux that makes any remote Nvidia GPU accessible to, and shareable across, any number of remote clients running local applications, all over standard networking.


> *Basically, we aren't targeting support for graphical applications running on Linux because there is very little demand for this - but we cover everything else. You CAN run graphical applications on Windows vs. a Linux server.

Ah, disappointing; I was hoping to try this with a Steam Deck as an alternative to using Moonlight and streaming the entire game from the Windows machine.


I'm confused, where is the actual source code? This repo only has some Dockerfiles that, as far as I can tell, are pulling precompiled opaque binaries and have some convenience scripts to set up the required runtime environment.


I don't think it's opensource.

I assume it cost quite some $$$ to produce this because you kinda have to cut nvidias binary drivers in half, which is going to require quite a lot of reverse engineering.


Having a software showcase or landing page on Github when the underlying software is not actually open source, while completely legitimate, always leaves a really bad taste in my mouth.


We started it on GitHub because we're planning to open source part of our code in the next month or two - namely the control plane business logic to determine which clients connect to which servers under what conditions, a k8s integration, etc.


It's not opensource, though I expect our control plane to be made open source very soon.

We've got some of the best devs working on this and we hope to impress!


Serverless GPU is all of the rage in the past month - I'd love to see a review of this from someone who knows how to benchmark a GPU workload.

In particular:

- Autoscaling Stable Diffusion Inference

- Traditional creative workflows (realtime GPU viewport in octane for example)

- Gaming from one GPU in your house to everywhere else

I get the training example for small models but can't imagine it scales that well with model size

The big value seems to be... share 1 GPU to many computers, so spend less on a cluster? Capacity fungibility is real value but hard to measure!

In any case, stuff like this is a good bet. GPU software will continue to increase in prevalence, and utilization will remain low. Solving for the compute market liquidity is important despite NVIDIA's best efforts.


We have all these running fantastically, please check out our discord where we have clips and and demonstrations of these sorts of workloads. https://discord.gg/2SWbpXx9


I'm at the server limit and can't join additional servers without leaving something else. Can you add this stuff to the github repo?


For anything involving inference you’re much better off with one of the many inference model servers such as TensorFlow serving, Triton Inference Server, etc.


That's the biggest problem with this model. With inference it's better to just use a dedicated model server. For training it's better to deploy on a massive dedicated machine. The only real use case left over is experimentation and debug for devs or students.


I don't doubt it at this point in time but can you say more?

I have to imagine a lot of ML infra today is built for Big Dedicated Deployments and not necessarily friendly with more serverless architectures.

That is to say, I'd guess a robust version of this has it's use cases - whether that value prop is in DX, autoscaling, architecture simplification... I'm not sure.


Inference servers essentially turn a model running on CPU and/or GPU hardware into a microservice.

Many of them support the kserve API standard[0] that supports everything from model loading/unloading to (of course) inference requests across models, versions, frameworks, etc.

So in the case of Triton[1] you can have any number of different TensorFlow/torch/tensorrt/onnx/etc models, versions, and variants. You can have one or more Triton instances running on hardware with access to local GPUs (for this example). Then you can put standard REST and or grpc load balancers (or whatever you want) in front of them, hit them via another API, whatever.

Now all your applications need to do to perform inference is do an HTTP POST (or use a client[2]) to a Triton endpoint URL for model input, Triton runs it on a GPU (or CPU if you want), and you get back whatever the model output is. So now everything else in your architecture other than Triton doesn't even know what a GPU or ML is.

It also makes deploying new models, versions, whatever much simpler - you POST them to Triton (or it loads them from S3, local disk, whatever) and they're instantly available everywhere.

Not a sales pitch for Triton but it (like some others) can also do things like dynamic batching with QoS parameters, automated model profiling and performance optimization[3], really granular control over resources, response caching, python middleware for application/biz logic, accelerated media processing with Nvidia DALI, memory management and control, all kinds of stuff.

[0] - https://github.com/kserve/kserve

[1] - https://github.com/triton-inference-server/server

[2] - https://github.com/triton-inference-server/client

[3] - https://github.com/triton-inference-server/model_analyzer


It surprises me that this works well enough to be useful. I would have thought that network latency, being orders of magnitude higher than memory latency, would be a huge problem. Latency Numbers Everyone Should Know: https://static.googleusercontent.com/media/sre.google/en//st...


I'd be surprised if this works for anything latency sensitive over anything more than a LAN.

Even just the time it takes speed of light between NY and LA (410^6m/310^8m/s=1/75s) is roughly how long a 60 fps frame is (1/60s). Add OS serializing the frame from the GPU onto the network card, network switching of those packets, and you're starting to really feel that latency.


There are people out there gaming at 30fps with their TV set to Super Duper Image Processing Mode 500ms Latency Edition. Though I suppose these are realistically already served by the cloud gaming offerings.


Yeah, I've tried with shadow.tech and you can feel the latency. There's enough throughput to get a quality video stream through but there's enough latency to feel annoyed. I only play sandbox games though so I imagine it'd be worse with something competitive.


The datacenter is probably not thousands, but hundreds of kilometers away so there is room to deliver 60fps. I was surprised how well GeForce Now works.


With 5G clouds now offers to deploy services to the small datacenters near 5G customers. In some cases less than 1 mile away.


Excuse me but does my 5G cloud have GPUs to spare right now, I could really use some shade.


I don’t understand what 5G has to do with this.


For gaming, this is obviously a no-go. But for bunch of AI/ML related workloads, it might make perfect sense.


GPUs usually run on big command buffers that are generated in a streaming fashion and then submitted at specific points, so it's theoretically possible that a game could hit 60fps this way. You'd just be eating extra latency between command buffer submission and actual rendering.


Here is a video of DOOM 2016 running at 60 fps. https://discord.com/channels/755570806397993111/755570806397...


Not so sure about no-go. The amount of GPU latency in modern AAA titles already approaches 20+ms in the most egregious cases.

Unless there is a need to evict all gpu memory on every frame, I think it is feasible to game on GPUs that live across a very fast LAN.


Fast ethernet is getting cheaper than ever - you can easily get 10gb on consumer gear or even 20 and used hardware on I believe 40 or maybe 100 is getting pretty affordable.


In my experience 40G gear can often be had for cheaper than 10G. I have a pair of Mellanox 40G Infiniband cards that cost me about $20 each on ebay and could be turned in to Ethernet cards with a few commands.


Output screencast could be encoded right in GPU pipeline


About 10 years ago I found set operations in ruby were slower then set operations in redis. So I shipped all my data over the network - let redis sort into a sorted set and then crunched my data in redis - retrieving it again over the network in its reduced form… I think it makes sense that for vector operations a remote gpu could be pretty cool. Now if we can get this working from MacBooks to Linux gpus I’d be pretty stoked


PCI-Express 16x 4.0 has 31,5 GByte/s. The fastest fiber ETH has 50 GB/s. So it "could" be useful if you have datacenter grade equipment ;)


Those aren't latency numbers though, they're throughput.


Since we have full visibility into the pipeline our (Juice labs) Weissman scores are off the charts!


Let me help you.

1) Take off the glass

2) Use a drill, make the connect wider

3) now bundle 9 glass and put in hole

4) now bandwidth is more wide

Fasterness!


Video and tensors could be compressed before transmission.


Neato, sounds like bitfusion in their early days!

Definitely of interest to us, even w/ latency limits, both for ai dev & investigations and occasional full runs

I do have to wonder about the non-oss licensing, as that's part of why we didn't spend much time on bitfusion...


Didn't we have those things already? Virtual-GL and Co. say hi.

Also for most real GPU applications, you need to get the data in and out. I don't think splitting compute across a (insert any non-Infiniband-link) solves this


100gbe is pretty similar to infiniband no? Or does infiniband still kill it on latency?


Infiniband avoids the network stack. Has ~2us latency these days over LAN.


compress it


I see lots of comments in various ML repositores about trouble running on multiple GPUs. This seems like a great way to run across multiple low VRAM GPUs instead of buying a huge expensive single card. It feels reminiscent of how Google built their clusters on commodity hardware where they would just throw away a failed device rather than trying to fix it. This is really cool.


I doubt this does multi-server. All the GPUs probably have to be on the same machine.


Glad to see a https://virtaitech.com/en/index competitor. As I know VirtAI doen't provide freeware. But they provide RDMA network and GPU pooling features. For guys interested in how this is done, I suggest have a look of https://github.com/ut-osa/gpunet and https://github.com/tkestack/vcuda-controller


That's really awesome. I'm not sure what I'd use it for but just being able to makes me want to find an excuse! What's impressive is this seems to have more capabilities than most "local" software vGPU solutions for e.g. VMs.


Do you have any numbers on the viability of using this for ML/AI workloads? seems like once a model is ingested into a gpu vram theoretically the transactional new inputs / outputs would be trivial.


For some use cases we're already at parity - YMMV.


Can this be used to accelerate video decode in a linux machine/virtual machine? It sounds like it is not for graphics on linux but it unclear to me where decode falls.


wait is the code actually FOSS or is this just freeware. I only see docker files.


Would this allow a VMware Workstation Linux VM use the GPU from a Windows Host with an Nvidia Video Card for ML usage?


does it really feel like the GPU I use is one on my machine ? or do I have lot of boilerplate to make it work client side ?


I haven't tried it yet, but based on their doc it seems like after setting the host in the juice.cfg, you basically just need to run `juicify [application path]`: https://github.com/Juice-Labs/Juice-Labs/wiki/Juice-for-Wind...


Cool! "But can it run Crysis?"


CUDA Driver API or Runtime API remoting?


Driver.


Damn, this is cool. Nice work.


no point without rdma enabled gpus…


Very nice.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: