perf degradation between vllm 0.14.0 to 0.15.1 (and nightly)

#25

by Meital - opened about 11 hours ago

about 11 hours ago

Hi, I love the model!
For both (m2.1 and m2.5), there is a performance (throughput) degradation between vllm 0.14.0 and newer versions, which is visible under high load with large requests. Is that something that you are aware of?
I see this on 4H100 and 8H100

I load the model like this (I got the best performance like this):

docker run -d --gpus all -p 0.0.0.0:8000:8000 -v ~/.cache/huggingface:/root/.cache/huggingface -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 -e VLLM_ATTENTION_BACKEND=FLASHINFER -e VLLM_FLASHINFER_MOE_BACKEND=throughput -e VLLM_USE_FLASHINFER_MOE_FP16=1 -e VLLM_USE_FLASHINFER_MOE_FP8=1 -e VLLM_USE_FLASHINFER_MOE_FP4=1 -e VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 -e SAFETENSORS_FAST_GPU=1 -e VLLM_SERVER_DEV_MODE=1 -e TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1 vllm/vllm-openai:v0.15.1 --model MiniMaxAI/MiniMax-M2.5 --port 8000 --max-model-len 128000 --max-num-seqs 64 --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think --enable-auto-tool-choice --tensor-parallel-size 4 --gpu-memory-utilization 0.9 --swap-space 16 --enable-expert-parallel --trust_remote_code

vLLM Performance Regression: v0.14.0 vs v0.15.1

Setup: MiniMax-M2.5, 4xGPU, TP=4, --max-num-seqs 64, --gpu-memory-utilization 0.9
Workload: 150 requests, 6s interval, ~37.5K input tokens, 4K output tokens, ~90% prefix cache hit

┌────────────────────────┬───────────┬───────────┬──────────────────┐                                       
│         Metric         │  v0.14.0  │  v0.15.1  │       Diff       │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ Latency mean           │ 97.0s     │ 113.3s    │ +16.8% worse     │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ Latency median         │ 95.9s     │ 118.9s    │ +24.0% worse     │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ Latency p95            │ 107.4s    │ 129.6s    │ +20.7%           │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ Latency p99            │ 130.0s    │ 152.3s    │ +17.2%           │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ TTFT mean              │ 3.08s     │ 3.56s     │ +15.6%           │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ TTFT median            │ 1.63s     │ 1.57s     │ ~same            │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ TTFT p95               │ 8.74s     │ 15.12s    │ +73.0% worse     │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ Out tok/s mean         │ 41.6      │ 36.1      │ -13.2% slower    │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ Out tok/s median       │ 41.7      │ 33.6      │ -19.4% slower    │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ Out tok/s (gen) median │ 42.6      │ 34.3      │ -19.5%           │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ System throughput      │ 617 tok/s │ 617 tok/s │ same             │
├────────────────────────┼───────────┼───────────┼──────────────────┤
│ Peak concurrency       │ 23        │ 26        │ +13% more pileup │
└────────────────────────┴───────────┴───────────┴──────────────────┘

Key findings:

~19% per-request throughput regression at steady-state (median: 41.7 vs 33.6 tok/s)
24% higher median latency (95.9s vs 118.9s)
TTFT p95 nearly doubled (8.7s vs 15.1s) - worst-case scheduling delays are much worse
System-wide throughput is the same (617 tok/s), meaning v0.15.1 compensates with more concurrent requests
(26 vs 23 peak), but individual user experience degrades significantly

I can share scripts for reduction if needed

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment