Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Per user throughput might be lower at the moment yes. We're working on GPU kernel level optimizations now to fix that.

But across all users on our system, the throughput is better because doing more prefills or a large number of grouped decodes has better utilization of the GPU.

The idea is that this works for someone who wants to build a product that is consistent across users in terms of initial response but can trade-off some E2E latency. It ensures that no one is waiting for a long time before getting the first response.



I don’t really get it. Prefill saturates compute and decode saturates memory bandwidth. Why are you not doing mixed batch?


You're totally right and we are doing a mixed batch. What we changed was the priority of performing prefills over decodes.

When looking at a variety of workloads, we realized that prioritizing finishing a query (priotizing decodes) lead to underutilization of the GPU. We noticed there tended to not be enough requests that are concurrently running (because prefill wasn't prioritized) to meaningfully utilize the memory bandwidth with available decodes. This lead to a system that was unfortunately neither compute nor memory bound.

By running mixed batches that prioritize prefills we still compute some decode tokens in our spare capacity, but ensure compute is as saturated as possible. This additionally leads to a buildup of decodes, so that when we are primarily computing decode we're pushing our memory bandwidth as much as we can.

Of course there is still plenty of improvements that can be made on this front. Finding a dynamic balance between prefill and decode that allows us to have both the memory bandwidth and compute being pushed to their limits is the goal from a scheduling perspective. There are a whole host of factors such as the model architecture, input-token:output-token ratio, underlying hardware, KV-cache allocation (and many more) that all play into the pressure placed on memory and compute, so there's definitely still exploration to be done!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: