Self-hosting

Run useknockout on your own GPU

Bare-metal escape hatch. If you already pay for a GPU server (Hetzner, Lambda, on-prem) and want zero per-call cost, this is the path. Modal self-host is easier; this is cheaper at steady-state scale.

Choose this path if you have ≥ 1 idle GPU and predictable load. For bursty traffic, Modal's autoscale-to-zero usually wins on cost.

1. Prerequisites

  • NVIDIA GPU with ≥ 12 GB VRAM (RTX 3090, 4090, A10, A40, A100, L4, L40)
  • NVIDIA driver ≥ 535, CUDA 12.x
  • Linux (tested on Ubuntu 22.04, Debian 12)
  • Docker with NVIDIA Container Toolkit, OR Python 3.10+
  • ~10 GB disk for weights + dependencies

2. Run

Two options. Docker is the smoothest path because the image bundles model weights and CUDA dependencies. From-source gives you full control over versions and lets you patch in custom routes.

# Pull the image
docker pull ghcr.io/useknockout/api:latest

# Run with GPU access. Adjust --gpus flag for your CUDA setup.
docker run -d --name knockout \
  --gpus all \
  -p 8000:8000 \
  -e KNOCKOUT_ADMIN_TOKEN=<random 32 chars> \
  ghcr.io/useknockout/api:latest

# Verify
curl http://localhost:8000/health

First start downloads BiRefNet weights (~880 MB). Subsequent starts are instant.

3. Reverse proxy + TLS

Don't expose port 8000 directly. nginx + Let's Encrypt = free, 5-minute setup.

# /etc/nginx/sites-available/knockout
server {
    listen 80;
    server_name api.yourcompany.com;
    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl http2;
    server_name api.yourcompany.com;

    ssl_certificate     /etc/letsencrypt/live/api.yourcompany.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/api.yourcompany.com/privkey.pem;

    client_max_body_size 12M;
    proxy_read_timeout 60s;

    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

# Issue cert + reload
sudo certbot --nginx -d api.yourcompany.com
sudo systemctl reload nginx

4. Performance tuning

Stock config processes ~5 images/sec on an L4. With fp16 + compile + request batching, you can push 12-15 images/sec.

# In app/model.py — flip to half precision and compile the forward pass.
# ~2x throughput, ~30% lower VRAM, no quality loss for BiRefNet.

import torch
from BiRefNet.models.birefnet import BiRefNet

model = BiRefNet(bb_pretrained=False).half().cuda()
model.load_state_dict(torch.load("./weights/birefnet/model.pth"))
model = torch.compile(model, mode="reduce-overhead")
model.eval()

5. When to pick this over Modal

ConcernModalBYO GPU
Cost at 100k img/mo~$30~$0 (sunk GPU)
Cost at 10M img/mo~$2,500~$0 (sunk GPU)
Cold start~3s0s (always warm)
Ops overheadMinimalYou manage updates, security, scaling
Best forBursty trafficSteady-state high volume
Time to first call~3 min~30 min

6. Updating

# Docker
docker pull ghcr.io/useknockout/api:latest
docker stop knockout && docker rm knockout
# re-run the docker run command from above

# From source
cd api
git pull
pip install -r requirements.txt
sudo systemctl restart knockout
Pin the image tag in production: ghcr.io/useknockout/api:v1.4.0 instead of latest. Auto-pullinglatest bites you when a regression ships.