Skip to main content
vLLM is a high-throughput inference engine for LLMs. It’s optimized for production serving with features like continuous batching, PagedAttention, and tensor parallelism.

Quick Start

pip install vllm
vllm serve Qwen/Qwen2.5-Coder-32B-Instruct --port 8000

Config

{
  "providers": {
    "vllm": {
      "baseUrl": "http://localhost:8000/v1",
      "api": "openai-completions",
      "models": [
        {
          "id": "Qwen/Qwen2.5-Coder-32B-Instruct",
          "name": "Qwen 2.5 Coder 32B",
          "contextWindow": 131072,
          "maxTokens": 8192,
          "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }
        }
      ]
    }
  }
}

Use it

{
  "agents": [
    { "name": "local-coder", "model": "vllm:Qwen/Qwen2.5-Coder-32B-Instruct" }
  ]
}
The model ID must exactly match the model name you used when starting the vLLM server.

Auto-Discovery

polpo models scan
Scans localhost:8000 for running vLLM instances and lists available models.

Provider Details

Provider IDvllm (custom)
Default port8000
API typeopenai-completions
Base URLhttp://localhost:8000/v1
API keyNot required
CostFree (self-hosted)

When to Use vLLM vs. Ollama

OllamavLLM
SetupOne commandPython + pip
Model managementBuilt-in (ollama pull)Manual (HuggingFace paths)
ThroughputGoodHigher (continuous batching)
Multi-GPULimitedTensor parallelism
Best forDevelopment, single userProduction, high throughput

Notes

  • vLLM supports tensor parallelism across multiple GPUs — ideal for serving large models (70B+) in production.
  • For multi-model serving, run multiple vLLM instances on different ports and configure each as a separate provider.
  • vLLM’s OpenAI-compatible endpoint supports tool calling, making it fully compatible with Polpo’s agent system.