perf degradation between vllm 0.14.0 to 0.15.1 (and nightly)
#25
by
Meital
- opened
Hi, I love the model!
For both (m2.1 and m2.5), there is a performance (throughput) degradation between vllm 0.14.0 and newer versions, which is visible under high load with large requests. Is that something that you are aware of?
I see this on 4H100 and 8H100
I load the model like this (I got the best performance like this):
docker run -d --gpus all -p 0.0.0.0:8000:8000 -v ~/.cache/huggingface:/root/.cache/huggingface -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 -e VLLM_ATTENTION_BACKEND=FLASHINFER -e VLLM_FLASHINFER_MOE_BACKEND=throughput -e VLLM_USE_FLASHINFER_MOE_FP16=1 -e VLLM_USE_FLASHINFER_MOE_FP8=1 -e VLLM_USE_FLASHINFER_MOE_FP4=1 -e VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 -e SAFETENSORS_FAST_GPU=1 -e VLLM_SERVER_DEV_MODE=1 -e TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1 vllm/vllm-openai:v0.15.1 --model MiniMaxAI/MiniMax-M2.5 --port 8000 --max-model-len 128000 --max-num-seqs 64 --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think --enable-auto-tool-choice --tensor-parallel-size 4 --gpu-memory-utilization 0.9 --swap-space 16 --enable-expert-parallel --trust_remote_code
vLLM Performance Regression: v0.14.0 vs v0.15.1
Setup: MiniMax-M2.5, 4xGPU, TP=4, --max-num-seqs 64, --gpu-memory-utilization 0.9
Workload: 150 requests, 6s interval, ~37.5K input tokens, 4K output tokens, ~90% prefix cache hit
┌────────────────────────┬───────────┬───────────┬──────────────────┐
│ Metric │ v0.14.0 │ v0.15.1 │ Diff │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ Latency mean │ 97.0s │ 113.3s │ +16.8% worse │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ Latency median │ 95.9s │ 118.9s │ +24.0% worse │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ Latency p95 │ 107.4s │ 129.6s │ +20.7% │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ Latency p99 │ 130.0s │ 152.3s │ +17.2% │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ TTFT mean │ 3.08s │ 3.56s │ +15.6% │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ TTFT median │ 1.63s │ 1.57s │ ~same │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ TTFT p95 │ 8.74s │ 15.12s │ +73.0% worse │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ Out tok/s mean │ 41.6 │ 36.1 │ -13.2% slower │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ Out tok/s median │ 41.7 │ 33.6 │ -19.4% slower │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ Out tok/s (gen) median │ 42.6 │ 34.3 │ -19.5% │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ System throughput │ 617 tok/s │ 617 tok/s │ same │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ Peak concurrency │ 23 │ 26 │ +13% more pileup │
└────────────────────────┴───────────┴───────────┴──────────────────┘
Key findings:
- ~19% per-request throughput regression at steady-state (median: 41.7 vs 33.6 tok/s)
- 24% higher median latency (95.9s vs 118.9s)
- TTFT p95 nearly doubled (8.7s vs 15.1s) - worst-case scheduling delays are much worse
- System-wide throughput is the same (617 tok/s), meaning v0.15.1 compensates with more concurrent requests
(26 vs 23 peak), but individual user experience degrades significantly
I can share scripts for reduction if needed