Mykhailo Pavlov
Back to Projects

Remote LLM Server — Lightweight Ollama-powered LLM backend

Released
Released Mar 2026
DockerOllamaLLMGPUAPILocal AIdocker-composeNVIDIA

About the Project

Remote LLM Server — Dockerized Ollama API

A lightweight, GPU-accelerated Ollama server running in Docker, designed to serve local LLMs to multiple clients across your network — no need to install models on every device.

Motivation

When building Anagnosi, I needed a reliable way to run large language models locally and serve them to multiple clients without duplicating model storage or setup on each machine. This project solves that with a single docker-compose up.

What It Does

  • Runs Ollama inside a Docker container with full GPU passthrough via the NVIDIA Container Toolkit
  • Exposes a REST API on port 11434, accessible from any device on your local network
  • Persists downloaded models in a named Docker volume so they survive container restarts
  • Keeps models loaded in memory for 24 hours (OLLAMA_KEEP_ALIVE=24h) to eliminate cold-start latency between requests

Stack

LayerTechnology
RuntimeDocker + Compose
LLM backendOllama
GPU accelerationNVIDIA Container Toolkit
APIOllama REST API (OpenAI-compatible)

Key Design Decisions

OLLAMA_HOST=0.0.0.0 — By default Ollama only listens on localhost. Setting this explicitly makes the API reachable from other machines on the network, which is the whole point of this setup.

Named volume for model storage — Models can be 4–30 GB each. Using a Docker volume keeps them outside the container lifecycle, so docker compose down and up don't require re-pulling everything.

No GPU? No problem — The deploy.resources block is optional. Ollama falls back to CPU automatically, making this setup usable for testing or smaller models (≤4B parameters) on any machine.

Security boundary is the LAN — The server is intentionally designed for trusted local networks. Exposing port 11434 to the public internet without a reverse proxy and authentication is explicitly discouraged.

Usage

# Start the server
docker compose up -d

# Pull a model
docker exec -it ollama ollama pull qwen3:4b

# Query from any device on your network
curl http://<server-ip>:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3:4b", "prompt": "Hello!", "stream": false}'

Outcome

A single-command deployment that turns any machine with a GPU (or even just a CPU) into a private LLM API server — ready to back any local AI application without cloud costs or data leaving your network.