About the Project

Remote LLM Server — Dockerized Ollama API

A lightweight, GPU-accelerated Ollama server running in Docker, designed to serve local LLMs to multiple clients across your network — no need to install models on every device.

Motivation

When building Anagnosi, I needed a reliable way to run large language models locally and serve them to multiple clients without duplicating model storage or setup on each machine. This project solves that with a single docker-compose up.

What It Does

Runs Ollama inside a Docker container with full GPU passthrough via the NVIDIA Container Toolkit
Exposes a REST API on port 11434, accessible from any device on your local network
Persists downloaded models in a named Docker volume so they survive container restarts
Keeps models loaded in memory for 24 hours (OLLAMA_KEEP_ALIVE=24h) to eliminate cold-start latency between requests

Stack

Layer	Technology
Runtime	Docker + Compose
LLM backend	Ollama
GPU acceleration	NVIDIA Container Toolkit
API	Ollama REST API (OpenAI-compatible)

Key Design Decisions

OLLAMA_HOST=0.0.0.0 — By default Ollama only listens on localhost. Setting this explicitly makes the API reachable from other machines on the network, which is the whole point of this setup.

Named volume for model storage — Models can be 4–30 GB each. Using a Docker volume keeps them outside the container lifecycle, so docker compose down and up don't require re-pulling everything.

No GPU? No problem — The deploy.resources block is optional. Ollama falls back to CPU automatically, making this setup usable for testing or smaller models (≤4B parameters) on any machine.

Security boundary is the LAN — The server is intentionally designed for trusted local networks. Exposing port 11434 to the public internet without a reverse proxy and authentication is explicitly discouraged.

Usage

# Start the server
docker compose up -d

# Pull a model
docker exec -it ollama ollama pull qwen3:4b

# Query from any device on your network
curl http://<server-ip>:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3:4b", "prompt": "Hello!", "stream": false}'

Outcome

A single-command deployment that turns any machine with a GPU (or even just a CPU) into a private LLM API server — ready to back any local AI application without cloud costs or data leaving your network.

Remote LLM Server — Lightweight Ollama-powered LLM backend