Run AI models on your own machine. Pick a model, run one command, get a chat interface and an API — no cloud, no accounts, no data leaving your device.
Works with OpenCode, VS Code Copilot, Cursor, and any OpenAI-compatible client.
Technical overview
Koda is a thin Makefile orchestration layer over llama.cpp. It manages a three-layer configuration system (.env defaults → profiles/.env-<model>.<quant> → inline overrides) and resolves model paths without triggering implicit downloads — checking MODEL_DIR first, then falling back to the Hugging Face cache via find.
make serve starts llama-server, which exposes both a built-in browser WebUI at http://localhost:8080 and an OpenAI-compatible HTTP API at http://localhost:8080/v1. The ALIAS variable pins a stable model ID for external tool compatibility regardless of quantization swap.
Deployment paths: native make (full GPU via Metal/CUDA/ROCm) or Docker Compose using the official ghcr.io/ggml-org/llama.cpp image (GPU on NVIDIA/AMD Linux only). Traefik HTTPS is opt-in via compose.traefik.yml; Caddy or Tailscale cover the native path.
- 🚀 Quick Start
- 🛠️ Key Workflows
- 🐳 Docker Compose
- 🛡️ Security & Privacy
- 📚 Documentation Index
- 🏗️ Built With
brew install git llama.cpp huggingface-cli fzfwinget install ggml-org.llama.cpp
winget install junegunn.fzf
winget install Python.Python.3
pip install huggingface_hub[cli]
makeis required on Windows. Use WSL, then inside WSL:sudo apt update && sudo apt install git make
Docker (no local binaries needed)
docker compose --env-file profiles/.env-Qwen3.5-27B.Q4_K_M up -dSee Docker Compose for GPU support details.
git clone https://github.com/a1exus/koda.git && cd kodamake checkPick a model profile from profiles/README.md, then:
make download ENV=profiles/.env-Qwen3.5-27B.Q4_K_M
make serve ENV=profiles/.env-Qwen3.5-27B.Q4_K_MYour server is now live:
- WebUI:
http://localhost:8080 - API:
http://localhost:8080/v1(OpenAI-compatible)
Smart Path Resolution: Koda looks for the model in
MODEL_DIRfirst, then falls back to the Hugging Face cache — no need to move files manually.
Tip: Use make list to see all profiles or make select for an interactive picker.
Every command requires an ENV file pointing to a model profile in profiles/. Koda prepends profiles/ automatically, so ENV=.env-gemma-4-31B-it.Q4_K_M works.
| Command | What it does |
|---|---|
make serve |
Starts the WebUI and OpenAI-compatible API server |
make chat |
Launches an interactive terminal session with the model |
make download |
Fetches model weights from Hugging Face using hf |
make list |
Lists all available model profiles in profiles/ |
make select |
Interactively select a model profile (requires fzf or gum) |
make cache |
Shows what models are in the local Hugging Face cache |
make check |
Verifies required binaries are installed and on PATH |
make check-model |
Verifies the model file for the given ENV is present |
Pass variables inline to any make target:
# Change port and restrict context window size
make serve ENV=profiles/.env-Qwen3.5-27B.Q4_K_M PORT=9090 CTX=8192
# Require an API key and expose metrics
make serve ENV=profiles/.env-Qwen3.5-27B.Q4_K_M API_KEY=my-secret METRICS=1
# Speculative decoding with a draft model
make serve ENV=profiles/.env-Qwen3.5-27B.Q4_K_M DRAFT_MODEL=./draft.ggufSee AGENTS.md for the full list of supported variables.
The Docker path requires only Docker — no make, no brew, no local binaries. The official ghcr.io/ggml-org/llama.cpp image is used.
docker compose --env-file profiles/.env-Qwen3.5-27B.Q4_K_M up -d| Platform | GPU in Docker | Notes |
|---|---|---|
| NVIDIA (Linux) | ✅ Full | Requires NVIDIA Container Toolkit. compose.yaml passes --gpus all automatically. |
| AMD (Linux) | ✅ Full | Set LLAMA_CPP_IMAGE=ghcr.io/ggml-org/llama.cpp:server-rocm in .env. |
| Apple Silicon (macOS) | ❌ CPU only | Docker on macOS runs in a Linux VM — Metal/GPU is not accessible. |
| Windows | ❌ CPU only | Same VM limitation. NVIDIA passthrough is possible via WSL2 but not officially supported here. |
Apple Silicon and Windows users: use the native
makepath (Options A / B above) to get GPU acceleration. Docker is fine for CPU-only use or quick testing.
See GEMINI.md for full Docker usage and configuration details.
Koda is local-first — your data never leaves your machine.
- Privacy: No telemetry, no tracking, no cloud dependencies.
- Integrity: Automated vulnerability and misconfiguration scanning via Trivy and GitHub Actions.
| File | Purpose |
|---|---|
| profiles/README.md | Catalog of bundled models, download links, and hardware requirements |
| AGENTS.md | Technical reference for developers and AI agents — all variables, targets, and behaviors |
| GEMINI.md | Full Docker Compose usage, volume sharing, GPU config, and override reference |
| OPENCODE.md | Integration guide for OpenCode |
| VSCODE.md | Integration guide for VS Code (Copilot BYOM, Continue, Roo) |
| CURSOR.md | Integration guide for Cursor (requires HTTPS — Traefik, Caddy, or Tailscale) |
| CADDY.md | HTTPS termination for native make serve (Apple Silicon, Windows) |
| TAILSCALE.md | Private remote access and multi-machine RPC pooling |
Koda is a thin layer standing on the shoulders of giants:
| Project | Role |
|---|---|
| llama.cpp | Inference engine — provides llama-server (API + WebUI) and llama-cli (terminal chat) |
| huggingface-cli | Model downloader — make download uses hf to fetch GGUF files from HuggingFace |
| fzf | Interactive profile picker — primary backend for make select |
| gum | Interactive profile picker — alternative backend for make select if fzf is not installed |
| Docker Compose | Containerized deployment path — no local binaries required |
| Traefik | Reverse proxy — provides HTTPS termination in the Docker Compose path |
| Caddy | HTTPS termination for the native make serve path — required for Cursor on Apple Silicon and Windows where Docker GPU is unavailable |
| Tailscale | Private network — secure remote access and multi-machine RPC pooling |
| Trivy | Security scanning — automated vulnerability checks via GitHub Actions |
Curated by DimkaNYC | Instagram Koda tooling is released under the Apache 2.0 License. Model weights belong to their respective creators.