Per user throughput might be lower at the moment yes. We're working on GPU kerne...

qeternity · on Dec 17, 2024

I don’t really get it. Prefill saturates compute and decode saturates memory bandwidth. Why are you not doing mixed batch?

DiederikVink · on Dec 17, 2024

You're totally right and we are doing a mixed batch. What we changed was the priority of performing prefills over decodes.

When looking at a variety of workloads, we realized that prioritizing finishing a query (priotizing decodes) lead to underutilization of the GPU. We noticed there tended to not be enough requests that are concurrently running (because prefill wasn't prioritized) to meaningfully utilize the memory bandwidth with available decodes. This lead to a system that was unfortunately neither compute nor memory bound.

By running mixed batches that prioritize prefills we still compute some decode tokens in our spare capacity, but ensure compute is as saturated as possible. This additionally leads to a buildup of decodes, so that when we are primarily computing decode we're pushing our memory bandwidth as much as we can.

Of course there is still plenty of improvements that can be made on this front. Finding a dynamic balance between prefill and decode that allows us to have both the memory bandwidth and compute being pushed to their limits is the goal from a scheduling perspective. There are a whole host of factors such as the model architecture, input-token:output-token ratio, underlying hardware, KV-cache allocation (and many more) that all play into the pressure placed on memory and compute, so there's definitely still exploration to be done!