I don't think that is obvious. If your use case demands lowest latency at any cost, you might run batch size 1. I believe replit's new code model (announced about a month ago) runs at batch 1 in prod, for example, because code completions have to feel really fast to be useful.
With TensorRT-LLM + in-flight batching you can oversubscribe that one batch slot, by beginning to process request N+1 while finishing request N, which can help a lot at scale.
I'm not sure about TensorRT, but in llama.cpp there are seperate kernals optimized for batching and single use inference. It makes a substantial difference.
I suppose one could get decent utilization by prompt processing one user while generating tokens for another.
Using an H100 for inference, especially without batching, sounds awfully expensive. Is cost much of a concern for you right now?