Skip to main content
Text Generation Inference (TGI) is Hugging Face’s production-grade inference server. It supports continuous batching, tensor parallelism, and quantization.

Setup

docker run --gpus all -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id Qwen/Qwen2.5-Coder-32B-Instruct

Config

{
  "providers": {
    "tgi": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "models": [
        {
          "id": "tgi-model",
          "name": "Qwen 2.5 Coder 32B (TGI)",
          "contextWindow": 131072,
          "maxTokens": 8192,
          "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }
        }
      ]
    }
  }
}

Use it

{
  "agents": [
    { "name": "local-coder", "model": "tgi:tgi-model" }
  ]
}

Auto-Discovery

polpo models scan
Scans localhost:8080 for a running TGI instance.

Provider Details

Provider IDtgi (custom)
Default port8080
API typeopenai-completions
Base URLhttp://localhost:8080/v1
API keyNot required
CostFree (self-hosted)

Notes

  • TGI is optimized for production serving with features like continuous batching and speculative decoding.
  • Docker with --gpus all is the recommended deployment method.
  • TGI supports the OpenAI-compatible endpoint at /v1/chat/completions, which Polpo uses via the openai-completions API mode.