Self-hosting
Run useknockout on your own GPU
Bare-metal escape hatch. If you already pay for a GPU server (Hetzner, Lambda, on-prem) and want zero per-call cost, this is the path. Modal self-host is easier; this is cheaper at steady-state scale.
Choose this path if you have ≥ 1 idle GPU and predictable load. For bursty traffic, Modal's autoscale-to-zero usually wins on cost.
1. Prerequisites
- NVIDIA GPU with ≥ 12 GB VRAM (RTX 3090, 4090, A10, A40, A100, L4, L40)
- NVIDIA driver ≥ 535, CUDA 12.x
- Linux (tested on Ubuntu 22.04, Debian 12)
- Docker with NVIDIA Container Toolkit, OR Python 3.10+
- ~10 GB disk for weights + dependencies
2. Run
Two options. Docker is the smoothest path because the image bundles model weights and CUDA dependencies. From-source gives you full control over versions and lets you patch in custom routes.
# Pull the image
docker pull ghcr.io/useknockout/api:latest
# Run with GPU access. Adjust --gpus flag for your CUDA setup.
docker run -d --name knockout \
--gpus all \
-p 8000:8000 \
-e KNOCKOUT_ADMIN_TOKEN=<random 32 chars> \
ghcr.io/useknockout/api:latest
# Verify
curl http://localhost:8000/healthFirst start downloads BiRefNet weights (~880 MB). Subsequent starts are instant.
3. Reverse proxy + TLS
Don't expose port 8000 directly. nginx + Let's Encrypt = free, 5-minute setup.
# /etc/nginx/sites-available/knockout
server {
listen 80;
server_name api.yourcompany.com;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl http2;
server_name api.yourcompany.com;
ssl_certificate /etc/letsencrypt/live/api.yourcompany.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/api.yourcompany.com/privkey.pem;
client_max_body_size 12M;
proxy_read_timeout 60s;
location / {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
# Issue cert + reload
sudo certbot --nginx -d api.yourcompany.com
sudo systemctl reload nginx4. Performance tuning
Stock config processes ~5 images/sec on an L4. With fp16 + compile + request batching, you can push 12-15 images/sec.
# In app/model.py — flip to half precision and compile the forward pass.
# ~2x throughput, ~30% lower VRAM, no quality loss for BiRefNet.
import torch
from BiRefNet.models.birefnet import BiRefNet
model = BiRefNet(bb_pretrained=False).half().cuda()
model.load_state_dict(torch.load("./weights/birefnet/model.pth"))
model = torch.compile(model, mode="reduce-overhead")
model.eval()5. When to pick this over Modal
| Concern | Modal | BYO GPU |
|---|---|---|
| Cost at 100k img/mo | ~$30 | ~$0 (sunk GPU) |
| Cost at 10M img/mo | ~$2,500 | ~$0 (sunk GPU) |
| Cold start | ~3s | 0s (always warm) |
| Ops overhead | Minimal | You manage updates, security, scaling |
| Best for | Bursty traffic | Steady-state high volume |
| Time to first call | ~3 min | ~30 min |
6. Updating
# Docker docker pull ghcr.io/useknockout/api:latest docker stop knockout && docker rm knockout # re-run the docker run command from above # From source cd api git pull pip install -r requirements.txt sudo systemctl restart knockout
Pin the image tag in production:
ghcr.io/useknockout/api:v1.4.0 instead of latest. Auto-pullinglatest bites you when a regression ships.