When hardware is so expensive and so difficult to obtain, performance starts to trump the cost of “single model implementation” pretty quickly.
It can be cheaper to deploy Llama.cpp and then foobar.cpp (6 months from now) than it is to have inference that is 2x slower.
Interestingly in the LLM space all these model servers seem to be converging to using the same API as OpenAI, making it easy to swap containers to get a different model+inference server with 0 code change.
It can be cheaper to deploy Llama.cpp and then foobar.cpp (6 months from now) than it is to have inference that is 2x slower.
Interestingly in the LLM space all these model servers seem to be converging to using the same API as OpenAI, making it easy to swap containers to get a different model+inference server with 0 code change.