Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> We can achieve up to 100 tokens per second single-stream while GPT-4 runs around 20 tokens per second at best.

Is that with batching? If so, thats quite impressive.

> certain challenging questions where it is capable of getting the right answer, the Phind Model might take more generations to get to the right answer than GPT-4.

Some of this is sampler tuning. Y'all should look at grammar based sampling (https://github.com/ggerganov/llama.cpp/pull/1773) if you aren't using it already, as well as some of the "dynamic" sampling like mirostat and dynatemp: https://github.com/LostRuins/koboldcpp/pull/464

I think these should work with nvidia's implementation if you just swap the sampling out with the HF version.

BTW, all this is a great advantage of pulling away from OpenAI. You can dig in and implement experimental features that you just can't necessarily do through their API.



We leverage Flash Decoding (https://crfm.stanford.edu/2023/10/12/flashdecoding.html) in TensorRT-LLM to achieve 100 tokens per second on H100s.


is that impressive? I was thinking 100 tok/s on an H100 is really slow considering LMDeploy claims 2000+ on an A100 and a large batch size.


We get 100 tokens a second with batch size 1. Those 2000+ figures are for large batches.


Ah, that's fair, and faster than any of the LMDeploy stats for batch size 1; nice work!

Using an H100 for inference, especially without batching, sounds awfully expensive. Is cost much of a concern for you right now?


I don't think they're saying they're doing batch size of 1, just giving performance expectations of user facing performance


Yeah, and this is basically what I was asking.

100 tokens/s on the user's end, on a host that is batching requests, is very impressive.


I think they _are_ saying batch size 1, given that rushingcreek is OP.


Yes they are saying batch size 1 for the benchmarks, but they aren't doing batch size 1 in prod (obviously).


I don't think that is obvious. If your use case demands lowest latency at any cost, you might run batch size 1. I believe replit's new code model (announced about a month ago) runs at batch 1 in prod, for example, because code completions have to feel really fast to be useful.

With TensorRT-LLM + in-flight batching you can oversubscribe that one batch slot, by beginning to process request N+1 while finishing request N, which can help a lot at scale.


I'm not sure about TensorRT, but in llama.cpp there are seperate kernals optimized for batching and single use inference. It makes a substantial difference.

I suppose one could get decent utilization by prompt processing one user while generating tokens for another.


Without batching, I was actually thinking that's kind of modest.

ExllamaV2 will get 48 tokens/s on a 4090, which is much slower/cheaper than an H100:

https://github.com/turboderp/exllamav2#performance

I didn't test codellama, but the 3090 TI figures for other sizes are in the ballpark of my generation speed on a 3090.

100 tokens/s batched throughput (for each individual user) is much harder.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: