Ah, that's fair, and faster than any of the LMDeploy stats for batch size 1; nic...

lyjackal · on Oct 31, 2023

I don't think they're saying they're doing batch size of 1, just giving performance expectations of user facing performance

brucethemoose2 · on Oct 31, 2023

Yeah, and this is basically what I was asking.

100 tokens/s on the user's end, on a host that is batching requests, is very impressive.

claytonjy · on Oct 31, 2023

I think they _are_ saying batch size 1, given that rushingcreek is OP.

ErikBjare · on Nov 1, 2023

Yes they are saying batch size 1 for the benchmarks, but they aren't doing batch size 1 in prod (obviously).

claytonjy · on Nov 1, 2023

I don't think that is obvious. If your use case demands lowest latency at any cost, you might run batch size 1. I believe replit's new code model (announced about a month ago) runs at batch 1 in prod, for example, because code completions have to feel really fast to be useful.

With TensorRT-LLM + in-flight batching you can oversubscribe that one batch slot, by beginning to process request N+1 while finishing request N, which can help a lot at scale.

brucethemoose2 · on Nov 1, 2023

I'm not sure about TensorRT, but in llama.cpp there are seperate kernals optimized for batching and single use inference. It makes a substantial difference.

I suppose one could get decent utilization by prompt processing one user while generating tokens for another.