> We can achieve up to 100 tokens per second single-stream while GPT-4 runs around 20 tokens per second at best.
Is that with batching? If so, thats quite impressive.
> certain challenging questions where it is capable of getting the right answer, the Phind Model might take more generations to get to the right answer than GPT-4.
I think these should work with nvidia's implementation if you just swap the sampling out with the HF version.
BTW, all this is a great advantage of pulling away from OpenAI. You can dig in and implement experimental features that you just can't necessarily do through their API.
I don't think that is obvious. If your use case demands lowest latency at any cost, you might run batch size 1. I believe replit's new code model (announced about a month ago) runs at batch 1 in prod, for example, because code completions have to feel really fast to be useful.
With TensorRT-LLM + in-flight batching you can oversubscribe that one batch slot, by beginning to process request N+1 while finishing request N, which can help a lot at scale.
I'm not sure about TensorRT, but in llama.cpp there are seperate kernals optimized for batching and single use inference. It makes a substantial difference.
I suppose one could get decent utilization by prompt processing one user while generating tokens for another.
Is that with batching? If so, thats quite impressive.
> certain challenging questions where it is capable of getting the right answer, the Phind Model might take more generations to get to the right answer than GPT-4.
Some of this is sampler tuning. Y'all should look at grammar based sampling (https://github.com/ggerganov/llama.cpp/pull/1773) if you aren't using it already, as well as some of the "dynamic" sampling like mirostat and dynatemp: https://github.com/LostRuins/koboldcpp/pull/464
I think these should work with nvidia's implementation if you just swap the sampling out with the HF version.
BTW, all this is a great advantage of pulling away from OpenAI. You can dig in and implement experimental features that you just can't necessarily do through their API.