Ollama is the easiest way to run large language models locally. It handles downloading, quantization, GPU acceleration, and serving models behind a simple CLI and REST API. Think of it as Docker for LLMs.
This tutorial covers everything from installation through advanced configuration, so you can use Ollama as a local AI backend for coding tools, chat interfaces, and your own projects.
Installation
Linux
The officially recommended method is the install script:
curl -fsSL https://ollama.com/install.sh | shWarning
Piping scripts from the internet into your shell is a security risk — you are giving an external party root access to your machine. Always inspect the script first:
curl -fsSL https://ollama.com/install.sh -o install-ollama.sh less install-ollama.sh # read it, understand what it does sh install-ollama.shThe script calls
sudointernally only where it needs elevated privileges (installing the binary, creating a systemd service, etc.), so do not run it withsudo— that would give the entire script root access unnecessarily. It is open source on GitHub if you want to review it before downloading.
Alternatively, Ollama is available as a snap — no script piping required:
sudo snap install ollamaThe snap is published by the Ollama team and handles updates automatically through the snap daemon.
Verify the installation either way:
ollama --versionmacOS and Windows
On macOS, download the .zip from ollama.com/download and drag the app into /Applications. The CLI is installed automatically.
On Windows, download and run the .exe installer from the same page.
Basic CLI Usage
Run a model interactively (downloading it first if needed):
ollama run llama3.2This drops you into a chat session. Type your prompt and press Enter. Use """ to begin and end multiline input. Type /bye to exit.
Here is the full command reference:
| Command | Description |
|---|---|
ollama run <model> | Download (if needed) and run a model interactively |
ollama pull <model> | Download a model without running it |
ollama list | List all locally downloaded models |
ollama show <model> | Show model info (architecture, quantization, template, license) |
ollama ps | List models currently loaded in memory |
ollama stop <model> | Unload a model from memory |
ollama rm <model> | Delete a downloaded model |
ollama cp <src> <dest> | Duplicate a model |
ollama create <name> -f <Modelfile> | Create a custom model from a Modelfile |
ollama serve | Start the Ollama server manually |
Inside an interactive session, you can also inspect the running model:
| Command | Description |
|---|---|
/show info | Architecture and parameters |
/show modelfile | The Modelfile used to build this model |
/show system | System prompt |
/show template | Prompt template |
/show license | License text |
Managing Models
Pulling Models
Models follow a name:tag naming convention. The tag defaults to latest if omitted:
ollama pull llama3.2 # default tag (typically q4_K_M quantization)
ollama pull llama3.2:1b # 1B parameter variant
ollama pull llama3:8b-instruct-q8_0 # specific quantizationTags encode the parameter count, variant, and quantization level. The quantization suffixes control the tradeoff between model size and quality:
- q4_0 — 4-bit, smallest and fastest
- q4_K_M — Ollama’s default for most models, best balance of quality and size
- q5_K_M — slightly better quality than q4_K_M if you have VRAM headroom
- q8_0 — 8-bit, near-lossless quality, roughly double the size of q4
The K-quant variants (_K_S, _K_M, _K_L) use smarter quantization that preserves more accuracy at similar file sizes.
Inspecting and Removing Models
ollama list # see all downloaded models with sizes
ollama show llama3.2 # architecture, context length, quantization, etc.
ollama ps # which models are loaded, GPU/CPU split, memory usage
ollama rm llama3.2 # delete a modelPopular Models
Here are some worth knowing about:
| Model | Sizes | Notes |
|---|---|---|
llama3.2 | 1B, 3B | Compact Meta models, good for constrained hardware |
llama3.1 | 8B, 70B, 405B | Meta’s flagship, most-downloaded model on Ollama |
gemma3 | 1B—27B | Google, multimodal (text + vision) |
qwen3 | 0.6B—235B | Alibaba, dense and mixture-of-experts variants |
mistral | 7B | Mistral AI, fast and capable for its size |
deepseek-r1 | 1.5B—671B | Strong reasoning, chain-of-thought |
phi4 | 14B | Microsoft, punches above its weight |
qwen2.5-coder | 1.5B—32B | Code-specialized |
nomic-embed-text | — | Embedding model for RAG pipelines |
llava | 7B, 13B, 34B | Vision-language model |
Browse the full library at ollama.com/library.
Hardware, Model Sizes, and Performance
What “7B” Means
When a model is described as “7B” or “70B”, the number refers to billions of parameters — the learned weights that make up the neural network. More parameters generally means a more capable model, but also more memory and compute.
Think of it as a rough measure of the model’s “brain size”:
- 1B—3B: Can follow simple instructions, summarize text, and answer basic questions. Prone to hallucination and poor at complex reasoning. Useful for lightweight tasks, autocomplete, or hardware-constrained environments.
- 7B—8B: The sweet spot for most local use. Good at conversation, coding assistance, and general knowledge. Runs comfortably on a single consumer GPU.
- 13B—14B: Noticeably smarter than 7-8B. Better at nuanced reasoning, longer documents, and following complex instructions.
- 27B—32B: Approaching the quality of cloud-hosted models for many tasks. Requires a high-end GPU or Apple Silicon with 32+ GB of memory.
- 70B+: Near frontier-model quality. Requires professional hardware, multi-GPU setups, or a Mac with 64+ GB of unified memory.
Quantization and File Size
You do not run models at full precision. Ollama uses quantization to compress model weights from 16-bit floats down to 4-8 bits, dramatically reducing file size and memory usage with surprisingly little quality loss.
The formula is straightforward: size in GB ≈ (parameters in billions × bits per weight) / 8. Here is what that looks like in practice:
| Parameters | q4_0 (4-bit) | q4_K_M (~4.5-bit) | q5_K_M (~5.5-bit) | q8_0 (8-bit) | f16 (16-bit) |
|---|---|---|---|---|---|
| 1B | ~0.6 GB | ~0.7 GB | ~0.8 GB | ~1.1 GB | ~2 GB |
| 3B | ~1.7 GB | ~2.0 GB | ~2.3 GB | ~3.2 GB | ~6 GB |
| 7-8B | ~4 GB | ~4.5 GB | ~5.2 GB | ~8 GB | ~15 GB |
| 13-14B | ~7 GB | ~8 GB | ~9.5 GB | ~14 GB | ~27 GB |
| 27B | ~13.5 GB | ~16 GB | ~18.5 GB | ~27 GB | ~54 GB |
| 32B | ~16 GB | ~19 GB | ~22 GB | ~32 GB | ~64 GB |
| 70B | ~35 GB | ~40 GB | ~47 GB | ~70 GB | ~140 GB |
These numbers represent both the download size and the memory consumed by model weights alone. Ollama defaults to q4_K_M for most models — it is the best balance of quality, speed, and memory.
VRAM Requirements
Total VRAM usage is model weights + KV cache + overhead. The KV cache stores the context of the conversation and grows with context length. At a 4K context window, it is relatively small (~0.5-1.5 GB), but at longer contexts it can exceed the size of the model itself.
Practical totals at q4_K_M with ~4K context:
| Parameters | Weights | + KV Cache + Overhead | Total VRAM |
|---|---|---|---|
| 1B | ~0.7 GB | ~0.5 GB | ~1.5 GB |
| 3B | ~2 GB | ~0.5 GB | ~3 GB |
| 7-8B | ~4.5 GB | ~1 GB | ~6 GB |
| 13-14B | ~8 GB | ~1.5 GB | ~10 GB |
| 27B | ~16 GB | ~2 GB | ~18 GB |
| 32B | ~19 GB | ~2.5 GB | ~22 GB |
| 70B | ~40 GB | ~3 GB | ~43 GB |
Practical VRAM Guide
If you are wondering which models your hardware can run:
| Available VRAM | What Fits (q4_K_M) | Examples |
|---|---|---|
| 4 GB | Up to 3B | Llama 3.2 1B/3B, Phi-3 Mini |
| 6 GB | Up to 7B | Mistral 7B, Qwen 2.5 7B |
| 8 GB | Up to 8B | Llama 3.1 8B, Gemma 2 9B |
| 12 GB | Up to 13-14B | Phi-4 14B, Llama 2 13B |
| 16 GB | Up to 14B with generous context | RTX 4070 Ti / 4080 territory |
| 24 GB | Up to 32B | Qwen 2.5 32B, DeepSeek R1 32B |
| 48 GB | 70B models | Llama 3.1 70B, Qwen 2.5 72B |
Apple Silicon is a special case. Its unified memory is shared between CPU and GPU with no PCIe bottleneck, so a 32 GB Mac can run models that would require a $1500 NVIDIA GPU on a PC. A 64 GB Mac can run 70B models comfortably.
CPU vs. GPU Performance
The difference is dramatic. Here are rough token generation speeds for a 7-8B model at q4_K_M:
| Hardware | ~Tokens/sec | Feel |
|---|---|---|
| Laptop CPU (Intel i7) | 7—10 | Usable but slow |
| Desktop CPU (Ryzen 7) | 12—20 | Comfortable for chat |
| RTX 3060 (12 GB) | 50—70 | Smooth |
| RTX 4070 Ti (16 GB) | 50—80 | Smooth |
| RTX 4090 (24 GB) | 100—140 | Instant |
| Apple M1 Pro/Max | 25—40 | Good |
| Apple M3 Max | 45—55 | Very good |
| Apple M4 Max | 55—60 | Excellent |
As a rough reference: 2 tok/s feels like watching someone type slowly. 10 tok/s is comfortable for reading chat responses. 30+ tok/s is what you want for coding tools and streaming integrations.
Tip
For Apple Silicon, memory bandwidth matters more than chip generation. An M3 Max (400 GB/s) can outperform an M4 Pro (273 GB/s) for LLM inference because bandwidth is the bottleneck. When choosing a Mac for local LLMs, prioritize: (1) enough unified memory to fit the model, then (2) maximum memory bandwidth.
GPU vs. CPU Offloading
When a model does not fully fit in VRAM, Ollama automatically splits layers between GPU and CPU. You can see this in ollama ps:
$ ollama ps
NAME ID SIZE PROCESSOR UNTIL
llama3.1:8b a23b46a1c3e2 6.3 GB 100% GPU 4 minutes from now
deepseek-r1:70b a951a23b46a1 42 GB 78%/22% CPU/GPU 4 minutes from nowThe PROCESSOR column tells you what is happening. 100% GPU is ideal. Any CPU/GPU split means significantly degraded performance — the GPU has to wait for the CPU to process its layers, and PCIe bandwidth (~16 GB/s) is a fraction of VRAM bandwidth (~900 GB/s on an RTX 4090).
Warning
A model that is 90% in VRAM and 10% on CPU will be dramatically slower than one at 100% GPU. If a model barely does not fit, it is often better to drop to a smaller model or a lower quantization rather than accept the split.
Context Length and Memory
Longer context windows consume more VRAM through the KV cache. This is where people often run out of memory unexpectedly:
| Context Length | KV Cache (8B model, f16) | Total VRAM with q4_K_M Weights |
|---|---|---|
| 4,096 | ~0.5 GB | ~6 GB |
| 16,384 | ~1.8 GB | ~7 GB |
| 32,768 | ~3.6 GB | ~9 GB |
| 65,536 | ~7.2 GB | ~12 GB |
| 131,072 | ~14.4 GB | ~20 GB |
At 128K context, the KV cache alone uses more memory than the model weights. If you need long contexts on limited VRAM, quantize the KV cache:
sudo systemctl edit ollama.service[Service]
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"This halves KV cache memory with negligible quality impact. Setting q4_0 cuts it to a quarter.
Checking Your GPU
To verify Ollama detected your GPU:
ollama run llama3.2:3b "hello" # load a model
ollama ps # check the Processor columnFor more detail, check the server logs:
journalctl -u ollama --no-pager | grep -i gpuOn NVIDIA systems, nvidia-smi shows real-time VRAM usage while a model is loaded.
Keeping Ollama Updated
If you installed via snap, updates happen automatically. You can also trigger one manually:
sudo snap refresh ollamaIf you installed via the install script, re-run it (with the same inspect-first caveat from the installation section). It safely overwrites the existing binary while preserving your models:
curl -fsSL https://ollama.com/install.sh -o install-ollama.sh
less install-ollama.sh
sh install-ollama.shTip
When upgrading from a significantly older version, first remove stale libraries:
sudo rm -rf /usr/lib/ollama
On macOS, the app auto-detects updates — click the menubar icon and select “Restart to update.” On Windows, download and run the latest installer.
Models are stored separately from the Ollama binary, so they survive upgrades. Default locations:
- Linux:
/usr/share/ollama/.ollama/models - macOS:
~/.ollama/models - Windows:
C:\Users\%username%\.ollama\models
Modelfiles
Modelfiles are to Ollama what Dockerfiles are to Docker: declarative configuration files that define a custom model. They let you set a system prompt, tune parameters, and seed conversation examples on top of a base model.
Syntax
| Instruction | Required | Description |
|---|---|---|
FROM | Yes | Base model name or path to a GGUF file |
SYSTEM | No | System prompt defining the model’s role and behavior |
PARAMETER | No | Runtime parameter (temperature, context size, etc.) |
TEMPLATE | No | Go template for prompt formatting |
ADAPTER | No | Path to a LoRA fine-tuned adapter |
MESSAGE | No | Seed conversation examples (user, assistant, system) |
LICENSE | No | License text for the model |
Key Parameters
| Parameter | Default | Description |
|---|---|---|
temperature | 0.8 | Creativity (0.0 = deterministic, 1.0+ = creative) |
num_ctx | 2048 | Context window size in tokens |
top_k | 40 | Limit token selection to top K candidates |
top_p | 0.9 | Nucleus sampling threshold |
repeat_penalty | 1.0 | Penalize repeated tokens |
num_predict | -1 | Max tokens to generate (-1 = unlimited) |
stop | — | Stop sequence(s) to halt generation |
seed | 0 | Random seed (0 = random, set for reproducibility) |
Example: Focused Technical Assistant
Create a file called Modelfile:
FROM llama3.2
PARAMETER temperature 0.3
PARAMETER num_ctx 4096
SYSTEM """You are an expert technical assistant. Provide clear,
accurate, and concise answers to programming and system
administration questions. Include code examples when helpful."""Build and run it:
ollama create tech-assistant -f ./Modelfile
ollama run tech-assistantExample: Creative Writing Model
FROM llama3.2
PARAMETER temperature 0.9
PARAMETER num_ctx 8192
SYSTEM """You are a creative writing assistant. Generate engaging,
imaginative prose with vivid descriptions."""Example: Seeded Conversation
You can prime the model with example exchanges using MESSAGE:
FROM llama3.2
PARAMETER temperature 0.5
SYSTEM You are a helpful coding tutor who explains concepts with examples.
MESSAGE user How do I declare a variable in Python?
MESSAGE assistant In Python, you simply assign a value: `x = 42`. No type declaration needed — Python infers the type at runtime.Inspecting Existing Modelfiles
To see the Modelfile behind any model:
ollama show --modelfile llama3.2This is useful for understanding how official models are configured before building your own.
Tool-Calling Models
Some models support tool use (also called function calling), meaning the model can request that your application execute a function and return the result. This is how LLMs interact with filesystems, databases, APIs, and other external systems.
Which Models Support Tools?
Not all models support tool calling. Models that do include:
llama3.1andllama3.2qwen3andqwen2.5mistralandmistral-nemogemma3command-randcommand-r-plus
You can check whether a model supports tools by looking at its page on ollama.com/library — models that support tools are tagged accordingly.
How Tool Calling Works
Tool calling is used through the /api/chat endpoint. You send a list of available tools (as JSON Schema function definitions) along with the conversation, and the model may respond with a tool_calls message instead of a regular text response. Your application then executes the function, sends the result back, and the model incorporates it into its answer.
Here is the flow:
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{"role": "user", "content": "What files are in /tmp?"}
],
"tools": [
{
"type": "function",
"function": {
"name": "list_directory",
"description": "List files in a directory",
"parameters": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "The directory path to list"
}
},
"required": ["path"]
}
}
}
],
"stream": false
}'If the model decides to use the tool, the response will contain a tool_calls array instead of a regular content field:
{
"message": {
"role": "assistant",
"tool_calls": [
{
"function": {
"name": "list_directory",
"arguments": { "path": "/tmp" }
}
}
]
}
}Your application executes the function, then sends the result back as a tool role message:
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{"role": "user", "content": "What files are in /tmp?"},
{"role": "assistant", "tool_calls": [{"function": {"name": "list_directory", "arguments": {"path": "/tmp"}}}]},
{"role": "tool", "content": "file1.txt\nfile2.log\ndata.csv"}
],
"stream": false
}'The model then responds with a natural language answer incorporating the tool result.
Warning
Tool calling requires careful prompt engineering with smaller models. If a model is not reliably generating tool calls, try a larger variant (e.g., move from 8B to 70B) or increase
num_ctxto give the model more room to reason about the tool definitions.
Context Window Matters
Tool definitions are injected into the prompt and consume context tokens. If you define many tools or tools with complex schemas, you may need to increase num_ctx:
FROM llama3.2
PARAMETER num_ctx 8192Applications like OpenCode and Open WebUI handle the tool-calling loop automatically — you only need to deal with it directly when building your own integrations.
Running Ollama as a Server
The Systemd Service
On Linux, the install script sets up Ollama as a systemd service that starts automatically:
sudo systemctl status ollama # check status
sudo systemctl start ollama # start
sudo systemctl stop ollama # stop
sudo systemctl restart ollama # restartYou can verify the server is running by hitting the root endpoint:
curl http://localhost:11434
# Ollama is runningManual Server
If you are not using systemd, start the server yourself:
ollama serveBy default, Ollama binds to 127.0.0.1:11434 (localhost only).
Exposing Ollama to the Network
To let other machines or services on your LAN reach Ollama, change the bind address. For systemd:
sudo systemctl edit ollama.serviceAdd under [Service]:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"Then reload and restart:
sudo systemctl daemon-reload
sudo systemctl restart ollamaWarning
Setting
OLLAMA_HOST=0.0.0.0exposes Ollama to your entire network. Only do this on trusted networks, or use a reverse proxy like Butler to add multi-user authentication (API keys, JWT, or OIDC), model-level authorization, and rate limiting.
For a manual server, just export the variable:
OLLAMA_HOST=0.0.0.0:11434 ollama serveNote
Shell-exported environment variables do not apply to systemd services. You must use
systemctl editfor persistent configuration.
Environment Variables
These control server behavior. Set them via systemctl edit on Linux, launchctl setenv on macOS, or system environment variables on Windows:
| Variable | Default | Description |
|---|---|---|
OLLAMA_HOST | 127.0.0.1:11434 | Bind address and port |
OLLAMA_MODELS | OS-specific | Model storage directory |
OLLAMA_ORIGINS | — | Allowed CORS origins (e.g., * for dev) |
OLLAMA_KEEP_ALIVE | 5m | How long models stay loaded (-1 = forever, 0 = unload immediately) |
OLLAMA_NUM_PARALLEL | 1 | Max parallel requests per model |
OLLAMA_MAX_LOADED_MODELS | Auto | Max models loaded simultaneously |
OLLAMA_CONTEXT_LENGTH | VRAM-dependent | Default context window (4K/32K/256K based on available VRAM) |
OLLAMA_FLASH_ATTENTION | — | Set to 1 to enable flash attention (saves VRAM) |
OLLAMA_KV_CACHE_TYPE | f16 | K/V cache quantization: q8_0 (half memory) or q4_0 (quarter) |
Serving Models for Other Local Services
If you have a project that expects an LLM API on a specific port, you can point it at Ollama. For example, a project at ../linkedin-copilot that expects an OpenAI-compatible API:
# Ensure Ollama is running
sudo systemctl start ollama
# Pull the model you want to serve
ollama pull llama3.2
# Point your application at Ollama's OpenAI-compatible endpoint
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=ollama # any non-empty string worksOllama’s /v1 endpoint is a drop-in replacement for the OpenAI API, so most tools that support a configurable OpenAI base URL will work without modification.
Multiple Services, One Port
Ollama listens on a single port, but every API request specifies which model to use in the request body ("model": "llama3.2"). This means multiple applications can share the same Ollama server — each one simply requests a different model (or the same one). Ollama handles the multiplexing:
- It loads models into memory on demand and keeps them loaded for 5 minutes by default (configurable via
OLLAMA_KEEP_ALIVE). - It can hold multiple models in memory simultaneously (configurable via
OLLAMA_MAX_LOADED_MODELS, which defaults to 3 per GPU). - It can handle parallel requests to the same model (configurable via
OLLAMA_NUM_PARALLEL).
Per-request parameters like temperature, num_ctx, and top_p are sent as part of each API call, so different services can use different settings against the same model without conflicting.
# Service A talks to mistral
curl http://localhost:11434/api/chat -d '{"model":"mistral","messages":[{"role":"user","content":"Summarize DNS in one sentence."}],"stream":false}'
# Service B talks to llama3.2 on the same port
curl http://localhost:11434/api/chat -d '{"model":"llama3.2","messages":[{"role":"user","content":"List three HTTP status codes and what they mean."}],"stream":false}'
# Both are served by the same Ollama processNote
Ollama has no built-in access control. Any client that can reach the port can use any model with any parameters. If you need to restrict which services can access which models, Butler is an access-control reverse proxy purpose-built for Ollama — it adds multi-user authentication (API keys, JWT, OIDC federation), per-user model policies with rate limiting, and structured audit logging. Alternatively, a general-purpose reverse proxy (nginx, Caddy) can add basic authentication. This is another good reason to keep
OLLAMA_HOSTbound to127.0.0.1unless you have a specific reason to expose it.
To keep models loaded in memory and avoid cold-start latency when switching between them:
sudo systemctl edit ollama.service[Service]
Environment="OLLAMA_KEEP_ALIVE=-1"The REST API
Ollama exposes a REST API on port 11434. The two most important endpoints:
Generate a Completion
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Explain DNS in one paragraph.",
"stream": false
}'Chat (Multi-Turn Conversation)
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is a VPN?"}
],
"stream": false
}'Full Endpoint Reference
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/generate | Text completion |
| POST | /api/chat | Chat completion (conversation) |
| POST | /api/embed | Generate embeddings |
| GET | /api/tags | List local models |
| POST | /api/show | Show model info |
| POST | /api/pull | Download a model |
| DELETE | /api/delete | Delete a model |
| POST | /api/create | Create a custom model |
| POST | /api/copy | Duplicate a model |
| GET | /api/ps | List running models |
OpenAI-Compatible Endpoint
Ollama also serves an OpenAI-compatible API at /v1/chat/completions. This means you can use OpenAI client libraries with Ollama by changing the base URL:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required but unused
)
response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)Tip
Streaming is on by default for the native API. Set
"stream": falsefor simpler scripting. The/v1OpenAI-compatible endpoint also supports streaming via SSE.
Open WebUI
Open WebUI gives you a ChatGPT-like web interface for your local Ollama models. It runs as a Docker container and stores all data locally.
Quick Start with Docker
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:mainOpen http://localhost:3000 in your browser. The first account you create gets admin privileges.
The --add-host flag lets the container reach Ollama running on your host machine. The -v flag persists your chat history and settings across container restarts.
With NVIDIA GPU Passthrough
docker run -d -p 3000:8080 \
--gpus all \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:cudaSingle-User Mode (No Login)
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-e WEBUI_AUTH=False \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:mainDocker Compose
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
ports:
- '3000:8080'
volumes:
- open-webui:/app/backend/data
extra_hosts:
- 'host.docker.internal:host-gateway'
restart: always
volumes:
open-webui:Updating Open WebUI
docker rm -f open-webui
docker pull ghcr.io/open-webui/open-webui:main
# Re-run the docker run command — the volume preserves your dataWithout Docker
pip install open-webui
open-webui serveUsing Ollama with OpenCode
OpenCode is an open-source terminal-based AI coding assistant. You can connect it to Ollama for a fully local, private coding workflow.
Manual Configuration
Create or edit ~/.config/opencode/opencode.json:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"ollama": {
"npm": "@ai-sdk/openai-compatible",
"name": "Ollama",
"options": {
"baseURL": "http://localhost:11434/v1"
},
"models": {
"qwen2.5-coder:32b": {
"name": "Qwen 2.5 Coder 32B"
}
}
}
}
}Warning
OpenCode needs a generous context window. Ollama’s default context length is VRAM-dependent, but OpenCode works best at 64K or higher. Create a Modelfile with a larger
num_ctx, or set theOLLAMA_CONTEXT_LENGTHenvironment variable.
Zero-Config with ollama launch
Ollama v0.15+ includes a launch command that handles all configuration automatically:
ollama launch # interactive tool picker
ollama launch opencode # launch OpenCode directly
ollama launch opencode --model qwen3-coder # specify modelThis also works with other coding tools:
ollama launch claude --model qwen3
ollama launch codex --model llama3.2Recommended Models for Coding
| Model | Context | Notes |
|---|---|---|
qwen3-coder | 256K | Purpose-built for agentic coding |
qwen2.5-coder:32b | 32K | Strong code generation, needs ~20 GB RAM |
qwen2.5-coder:7b | 32K | Good balance for 16 GB machines |
deepseek-coder:33b | 16K | Strong at code completion |
Complete Removal
If you need to fully uninstall Ollama and clean up everything it left behind:
If Installed via Snap
sudo snap remove ollamaThis removes the binary, the service, and any snap-managed data. You will still need to manually remove any models stored outside the snap (check ~/.ollama/ and /usr/share/ollama/).
If Installed via the Install Script
1. Stop and Remove the Service
sudo systemctl stop ollama
sudo systemctl disable ollama
sudo rm /etc/systemd/system/ollama.service
sudo systemctl daemon-reload2. Remove the Binary and Libraries
sudo rm $(which ollama)
sudo rm -rf /usr/lib/ollama3. Remove Downloaded Models and Data
sudo rm -rf /usr/share/ollamaIf you changed the model storage location via OLLAMA_MODELS, remove that directory instead.
4. Remove the Service User and Group
The install script creates a dedicated ollama user and group:
sudo userdel ollama
sudo groupdel ollama5. Remove User-Level Configuration
rm -rf ~/.ollamaAfter these steps, Ollama is completely gone from your system.
Building on Ollama
Once Ollama is running, it becomes the AI backend for a growing ecosystem of local-first tools. Two open-source projects that build directly on top of it:
-
Solux — a workflow automation engine that chains inputs, transforms, and Ollama LLM steps into YAML-defined pipelines. Summarize webpages, transcribe podcasts, classify documents, and push results to Slack, Obsidian, or a vector store — all running locally. Install with
pip install soluxand runsolux initto connect to your Ollama instance. -
Butler — an access-control reverse proxy for Ollama. If you share an Ollama server across multiple services or a homelab, Butler adds multi-user authentication (API keys, JWT standalone, OIDC federation), per-user model authorization and rate limiting, input filtering, and Prometheus observability. One Go binary, one YAML config, zero changes to Ollama or your clients.
For a hands-on example of using Ollama’s embedding and chat APIs together, see Build a Local RAG Pipeline with Ollama and ChromaDB.
Quick Reference
Cheat Sheet
# Install / update (inspect the script first — see Installation section)
curl -fsSL https://ollama.com/install.sh -o install-ollama.sh
less install-ollama.sh
sh install-ollama.sh
# Run a model
ollama run llama3.2
# Manage models
ollama pull mistral
ollama list
ollama show llama3.2
ollama ps
ollama rm mistral
# Custom model from Modelfile
ollama create my-model -f ./Modelfile
# Server management (Linux)
sudo systemctl status ollama
sudo systemctl restart ollama
sudo systemctl edit ollama.service # set env vars
# API
curl http://localhost:11434/api/chat -d '{"model":"llama3.2","messages":[{"role":"user","content":"hi"}],"stream":false}'
# Open WebUI
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main
# OpenCode
ollama launch opencode