Ollama: Run LLMs Locally · Steven Foerster

Ollama is the easiest way to run large language models locally. It handles downloading, quantization, GPU acceleration, and serving models behind a simple CLI and REST API. Think of it as Docker for LLMs.

This tutorial covers everything from installation through advanced configuration, so you can use Ollama as a local AI backend for coding tools, chat interfaces, and your own projects.

Installation

Linux

The officially recommended method is the install script:

curl -fsSL https://ollama.com/install.sh | sh

Warning

Piping scripts from the internet into your shell is a security risk — you are giving an external party root access to your machine. Always inspect the script first:
curl -fsSL https://ollama.com/install.sh -o install-ollama.sh
less install-ollama.sh   # read it, understand what it does
sh install-ollama.sh
The script calls sudo internally only where it needs elevated privileges (installing the binary, creating a systemd service, etc.), so do not run it with sudo — that would give the entire script root access unnecessarily. It is open source on GitHub if you want to review it before downloading.

Alternatively, Ollama is available as a snap — no script piping required:

sudo snap install ollama

The snap is published by the Ollama team and handles updates automatically through the snap daemon.

Verify the installation either way:

ollama --version

macOS and Windows

On macOS, download the .zip from ollama.com/download and drag the app into /Applications. The CLI is installed automatically.

On Windows, download and run the .exe installer from the same page.

Basic CLI Usage

Run a model interactively (downloading it first if needed):

ollama run llama3.2

This drops you into a chat session. Type your prompt and press Enter. Use """ to begin and end multiline input. Type /bye to exit.

Here is the full command reference:

Command	Description
`ollama run <model>`	Download (if needed) and run a model interactively
`ollama pull <model>`	Download a model without running it
`ollama list`	List all locally downloaded models
`ollama show <model>`	Show model info (architecture, quantization, template, license)
`ollama ps`	List models currently loaded in memory
`ollama stop <model>`	Unload a model from memory
`ollama rm <model>`	Delete a downloaded model
`ollama cp <src> <dest>`	Duplicate a model
`ollama create <name> -f <Modelfile>`	Create a custom model from a Modelfile
`ollama serve`	Start the Ollama server manually

Inside an interactive session, you can also inspect the running model:

Command	Description
`/show info`	Architecture and parameters
`/show modelfile`	The Modelfile used to build this model
`/show system`	System prompt
`/show template`	Prompt template
`/show license`	License text

Managing Models

Pulling Models

Models follow a name:tag naming convention. The tag defaults to latest if omitted:

ollama pull llama3.2           # default tag (typically q4_K_M quantization)
ollama pull llama3.2:1b        # 1B parameter variant
ollama pull llama3:8b-instruct-q8_0  # specific quantization

Tags encode the parameter count, variant, and quantization level. The quantization suffixes control the tradeoff between model size and quality:

q4_0 — 4-bit, smallest and fastest
q4_K_M — Ollama’s default for most models, best balance of quality and size
q5_K_M — slightly better quality than q4_K_M if you have VRAM headroom
q8_0 — 8-bit, near-lossless quality, roughly double the size of q4

The K-quant variants (_K_S, _K_M, _K_L) use smarter quantization that preserves more accuracy at similar file sizes.

Inspecting and Removing Models

ollama list                  # see all downloaded models with sizes
ollama show llama3.2         # architecture, context length, quantization, etc.
ollama ps                    # which models are loaded, GPU/CPU split, memory usage
ollama rm llama3.2           # delete a model

Popular Models

Here are some worth knowing about:

Model	Sizes	Notes
`llama3.2`	1B, 3B	Compact Meta models, good for constrained hardware
`llama3.1`	8B, 70B, 405B	Meta’s flagship, most-downloaded model on Ollama
`gemma3`	1B—27B	Google, multimodal (text + vision)
`qwen3`	0.6B—235B	Alibaba, dense and mixture-of-experts variants
`mistral`	7B	Mistral AI, fast and capable for its size
`deepseek-r1`	1.5B—671B	Strong reasoning, chain-of-thought
`phi4`	14B	Microsoft, punches above its weight
`qwen2.5-coder`	1.5B—32B	Code-specialized
`nomic-embed-text`	—	Embedding model for RAG pipelines
`llava`	7B, 13B, 34B	Vision-language model

Browse the full library at ollama.com/library.

Hardware, Model Sizes, and Performance

What “7B” Means

When a model is described as “7B” or “70B”, the number refers to billions of parameters — the learned weights that make up the neural network. More parameters generally means a more capable model, but also more memory and compute.

Think of it as a rough measure of the model’s “brain size”:

1B—3B: Can follow simple instructions, summarize text, and answer basic questions. Prone to hallucination and poor at complex reasoning. Useful for lightweight tasks, autocomplete, or hardware-constrained environments.
7B—8B: The sweet spot for most local use. Good at conversation, coding assistance, and general knowledge. Runs comfortably on a single consumer GPU.
13B—14B: Noticeably smarter than 7-8B. Better at nuanced reasoning, longer documents, and following complex instructions.
27B—32B: Approaching the quality of cloud-hosted models for many tasks. Requires a high-end GPU or Apple Silicon with 32+ GB of memory.
70B+: Near frontier-model quality. Requires professional hardware, multi-GPU setups, or a Mac with 64+ GB of unified memory.

Quantization and File Size

You do not run models at full precision. Ollama uses quantization to compress model weights from 16-bit floats down to 4-8 bits, dramatically reducing file size and memory usage with surprisingly little quality loss.

The formula is straightforward: size in GB ≈ (parameters in billions × bits per weight) / 8. Here is what that looks like in practice:

Parameters	q4_0 (4-bit)	q4_K_M (~4.5-bit)	q5_K_M (~5.5-bit)	q8_0 (8-bit)	f16 (16-bit)
1B	~0.6 GB	~0.7 GB	~0.8 GB	~1.1 GB	~2 GB
3B	~1.7 GB	~2.0 GB	~2.3 GB	~3.2 GB	~6 GB
7-8B	~4 GB	~4.5 GB	~5.2 GB	~8 GB	~15 GB
13-14B	~7 GB	~8 GB	~9.5 GB	~14 GB	~27 GB
27B	~13.5 GB	~16 GB	~18.5 GB	~27 GB	~54 GB
32B	~16 GB	~19 GB	~22 GB	~32 GB	~64 GB
70B	~35 GB	~40 GB	~47 GB	~70 GB	~140 GB

These numbers represent both the download size and the memory consumed by model weights alone. Ollama defaults to q4_K_M for most models — it is the best balance of quality, speed, and memory.

VRAM Requirements

Total VRAM usage is model weights + KV cache + overhead. The KV cache stores the context of the conversation and grows with context length. At a 4K context window, it is relatively small (~0.5-1.5 GB), but at longer contexts it can exceed the size of the model itself.

Practical totals at q4_K_M with ~4K context:

Parameters	Weights	+ KV Cache + Overhead	Total VRAM
1B	~0.7 GB	~0.5 GB	~1.5 GB
3B	~2 GB	~0.5 GB	~3 GB
7-8B	~4.5 GB	~1 GB	~6 GB
13-14B	~8 GB	~1.5 GB	~10 GB
27B	~16 GB	~2 GB	~18 GB
32B	~19 GB	~2.5 GB	~22 GB
70B	~40 GB	~3 GB	~43 GB

Practical VRAM Guide

If you are wondering which models your hardware can run:

Available VRAM	What Fits (q4_K_M)	Examples
4 GB	Up to 3B	Llama 3.2 1B/3B, Phi-3 Mini
6 GB	Up to 7B	Mistral 7B, Qwen 2.5 7B
8 GB	Up to 8B	Llama 3.1 8B, Gemma 2 9B
12 GB	Up to 13-14B	Phi-4 14B, Llama 2 13B
16 GB	Up to 14B with generous context	RTX 4070 Ti / 4080 territory
24 GB	Up to 32B	Qwen 2.5 32B, DeepSeek R1 32B
48 GB	70B models	Llama 3.1 70B, Qwen 2.5 72B

Apple Silicon is a special case. Its unified memory is shared between CPU and GPU with no PCIe bottleneck, so a 32 GB Mac can run models that would require a $1500 NVIDIA GPU on a PC. A 64 GB Mac can run 70B models comfortably.

CPU vs. GPU Performance

The difference is dramatic. Here are rough token generation speeds for a 7-8B model at q4_K_M:

Hardware	~Tokens/sec	Feel
Laptop CPU (Intel i7)	7—10	Usable but slow
Desktop CPU (Ryzen 7)	12—20	Comfortable for chat
RTX 3060 (12 GB)	50—70	Smooth
RTX 4070 Ti (16 GB)	50—80	Smooth
RTX 4090 (24 GB)	100—140	Instant
Apple M1 Pro/Max	25—40	Good
Apple M3 Max	45—55	Very good
Apple M4 Max	55—60	Excellent

As a rough reference: 2 tok/s feels like watching someone type slowly. 10 tok/s is comfortable for reading chat responses. 30+ tok/s is what you want for coding tools and streaming integrations.

Tip

For Apple Silicon, memory bandwidth matters more than chip generation. An M3 Max (400 GB/s) can outperform an M4 Pro (273 GB/s) for LLM inference because bandwidth is the bottleneck. When choosing a Mac for local LLMs, prioritize: (1) enough unified memory to fit the model, then (2) maximum memory bandwidth.

GPU vs. CPU Offloading

When a model does not fully fit in VRAM, Ollama automatically splits layers between GPU and CPU. You can see this in ollama ps:

$ ollama ps
NAME              ID              SIZE      PROCESSOR          UNTIL
llama3.1:8b       a23b46a1c3e2    6.3 GB    100% GPU           4 minutes from now
deepseek-r1:70b   a951a23b46a1    42 GB     78%/22% CPU/GPU    4 minutes from now

The PROCESSOR column tells you what is happening. 100% GPU is ideal. Any CPU/GPU split means significantly degraded performance — the GPU has to wait for the CPU to process its layers, and PCIe bandwidth (~16 GB/s) is a fraction of VRAM bandwidth (~900 GB/s on an RTX 4090).

Warning

A model that is 90% in VRAM and 10% on CPU will be dramatically slower than one at 100% GPU. If a model barely does not fit, it is often better to drop to a smaller model or a lower quantization rather than accept the split.

Context Length and Memory

Longer context windows consume more VRAM through the KV cache. This is where people often run out of memory unexpectedly:

Context Length	KV Cache (8B model, f16)	Total VRAM with q4_K_M Weights
4,096	~0.5 GB	~6 GB
16,384	~1.8 GB	~7 GB
32,768	~3.6 GB	~9 GB
65,536	~7.2 GB	~12 GB
131,072	~14.4 GB	~20 GB

At 128K context, the KV cache alone uses more memory than the model weights. If you need long contexts on limited VRAM, quantize the KV cache:

sudo systemctl edit ollama.service

[Service]
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"

This halves KV cache memory with negligible quality impact. Setting q4_0 cuts it to a quarter.

Checking Your GPU

To verify Ollama detected your GPU:

ollama run llama3.2:3b "hello"   # load a model
ollama ps                         # check the Processor column

For more detail, check the server logs:

journalctl -u ollama --no-pager | grep -i gpu

On NVIDIA systems, nvidia-smi shows real-time VRAM usage while a model is loaded.

Keeping Ollama Updated

If you installed via snap, updates happen automatically. You can also trigger one manually:

sudo snap refresh ollama

If you installed via the install script, re-run it (with the same inspect-first caveat from the installation section). It safely overwrites the existing binary while preserving your models:

curl -fsSL https://ollama.com/install.sh -o install-ollama.sh
less install-ollama.sh
sh install-ollama.sh

Tip

When upgrading from a significantly older version, first remove stale libraries: sudo rm -rf /usr/lib/ollama

On macOS, the app auto-detects updates — click the menubar icon and select “Restart to update.” On Windows, download and run the latest installer.

Models are stored separately from the Ollama binary, so they survive upgrades. Default locations:

Linux: /usr/share/ollama/.ollama/models
macOS: ~/.ollama/models
Windows: C:\Users\%username%\.ollama\models

Modelfiles

Modelfiles are to Ollama what Dockerfiles are to Docker: declarative configuration files that define a custom model. They let you set a system prompt, tune parameters, and seed conversation examples on top of a base model.

Syntax

Instruction	Required	Description
`FROM`	Yes	Base model name or path to a GGUF file
`SYSTEM`	No	System prompt defining the model’s role and behavior
`PARAMETER`	No	Runtime parameter (temperature, context size, etc.)
`TEMPLATE`	No	Go template for prompt formatting
`ADAPTER`	No	Path to a LoRA fine-tuned adapter
`MESSAGE`	No	Seed conversation examples (`user`, `assistant`, `system`)
`LICENSE`	No	License text for the model

Key Parameters

Parameter	Default	Description
`temperature`	0.8	Creativity (0.0 = deterministic, 1.0+ = creative)
`num_ctx`	2048	Context window size in tokens
`top_k`	40	Limit token selection to top K candidates
`top_p`	0.9	Nucleus sampling threshold
`repeat_penalty`	1.0	Penalize repeated tokens
`num_predict`	-1	Max tokens to generate (-1 = unlimited)
`stop`	—	Stop sequence(s) to halt generation
`seed`	0	Random seed (0 = random, set for reproducibility)

Example: Focused Technical Assistant

Create a file called Modelfile:

FROM llama3.2

PARAMETER temperature 0.3
PARAMETER num_ctx 4096

SYSTEM """You are an expert technical assistant. Provide clear,
accurate, and concise answers to programming and system
administration questions. Include code examples when helpful."""

Build and run it:

ollama create tech-assistant -f ./Modelfile
ollama run tech-assistant

Example: Creative Writing Model

FROM llama3.2

PARAMETER temperature 0.9
PARAMETER num_ctx 8192

SYSTEM """You are a creative writing assistant. Generate engaging,
imaginative prose with vivid descriptions."""

Example: Seeded Conversation

You can prime the model with example exchanges using MESSAGE:

FROM llama3.2

PARAMETER temperature 0.5

SYSTEM You are a helpful coding tutor who explains concepts with examples.

MESSAGE user How do I declare a variable in Python?
MESSAGE assistant In Python, you simply assign a value: `x = 42`. No type declaration needed — Python infers the type at runtime.

Inspecting Existing Modelfiles

To see the Modelfile behind any model:

ollama show --modelfile llama3.2

This is useful for understanding how official models are configured before building your own.

Tool-Calling Models

Some models support tool use (also called function calling), meaning the model can request that your application execute a function and return the result. This is how LLMs interact with filesystems, databases, APIs, and other external systems.

Which Models Support Tools?

Not all models support tool calling. Models that do include:

llama3.1 and llama3.2
qwen3 and qwen2.5
mistral and mistral-nemo
gemma3
command-r and command-r-plus

You can check whether a model supports tools by looking at its page on ollama.com/library — models that support tools are tagged accordingly.

How Tool Calling Works

Tool calling is used through the /api/chat endpoint. You send a list of available tools (as JSON Schema function definitions) along with the conversation, and the model may respond with a tool_calls message instead of a regular text response. Your application then executes the function, sends the result back, and the model incorporates it into its answer.

Here is the flow:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "user", "content": "What files are in /tmp?"}
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "list_directory",
        "description": "List files in a directory",
        "parameters": {
          "type": "object",
          "properties": {
            "path": {
              "type": "string",
              "description": "The directory path to list"
            }
          },
          "required": ["path"]
        }
      }
    }
  ],
  "stream": false
}'

If the model decides to use the tool, the response will contain a tool_calls array instead of a regular content field:

{
  "message": {
    "role": "assistant",
    "tool_calls": [
      {
        "function": {
          "name": "list_directory",
          "arguments": { "path": "/tmp" }
        }
      }
    ]
  }
}

Your application executes the function, then sends the result back as a tool role message:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "user", "content": "What files are in /tmp?"},
    {"role": "assistant", "tool_calls": [{"function": {"name": "list_directory", "arguments": {"path": "/tmp"}}}]},
    {"role": "tool", "content": "file1.txt\nfile2.log\ndata.csv"}
  ],
  "stream": false
}'

The model then responds with a natural language answer incorporating the tool result.

Warning

Tool calling requires careful prompt engineering with smaller models. If a model is not reliably generating tool calls, try a larger variant (e.g., move from 8B to 70B) or increase num_ctx to give the model more room to reason about the tool definitions.

Context Window Matters

Tool definitions are injected into the prompt and consume context tokens. If you define many tools or tools with complex schemas, you may need to increase num_ctx:

FROM llama3.2
PARAMETER num_ctx 8192

Applications like OpenCode and Open WebUI handle the tool-calling loop automatically — you only need to deal with it directly when building your own integrations.

Running Ollama as a Server

The Systemd Service

On Linux, the install script sets up Ollama as a systemd service that starts automatically:

sudo systemctl status ollama     # check status
sudo systemctl start ollama      # start
sudo systemctl stop ollama       # stop
sudo systemctl restart ollama    # restart

You can verify the server is running by hitting the root endpoint:

curl http://localhost:11434
# Ollama is running

Manual Server

If you are not using systemd, start the server yourself:

ollama serve

By default, Ollama binds to 127.0.0.1:11434 (localhost only).

Exposing Ollama to the Network

To let other machines or services on your LAN reach Ollama, change the bind address. For systemd:

sudo systemctl edit ollama.service

Add under [Service]:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

Then reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Warning

Setting OLLAMA_HOST=0.0.0.0 exposes Ollama to your entire network. Only do this on trusted networks, or use a reverse proxy like Butler to add multi-user authentication (API keys, JWT, or OIDC), model-level authorization, and rate limiting.

For a manual server, just export the variable:

OLLAMA_HOST=0.0.0.0:11434 ollama serve

Note

Shell-exported environment variables do not apply to systemd services. You must use systemctl edit for persistent configuration.

Environment Variables

These control server behavior. Set them via systemctl edit on Linux, launchctl setenv on macOS, or system environment variables on Windows:

Variable	Default	Description
`OLLAMA_HOST`	`127.0.0.1:11434`	Bind address and port
`OLLAMA_MODELS`	OS-specific	Model storage directory
`OLLAMA_ORIGINS`	—	Allowed CORS origins (e.g., `*` for dev)
`OLLAMA_KEEP_ALIVE`	`5m`	How long models stay loaded (`-1` = forever, `0` = unload immediately)
`OLLAMA_NUM_PARALLEL`	`1`	Max parallel requests per model
`OLLAMA_MAX_LOADED_MODELS`	Auto	Max models loaded simultaneously
`OLLAMA_CONTEXT_LENGTH`	VRAM-dependent	Default context window (4K/32K/256K based on available VRAM)
`OLLAMA_FLASH_ATTENTION`	—	Set to `1` to enable flash attention (saves VRAM)
`OLLAMA_KV_CACHE_TYPE`	`f16`	K/V cache quantization: `q8_0` (half memory) or `q4_0` (quarter)

Serving Models for Other Local Services

If you have a project that expects an LLM API on a specific port, you can point it at Ollama. For example, a project at ../linkedin-copilot that expects an OpenAI-compatible API:

# Ensure Ollama is running
sudo systemctl start ollama

# Pull the model you want to serve
ollama pull llama3.2

# Point your application at Ollama's OpenAI-compatible endpoint
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=ollama   # any non-empty string works

Ollama’s /v1 endpoint is a drop-in replacement for the OpenAI API, so most tools that support a configurable OpenAI base URL will work without modification.

Multiple Services, One Port

Ollama listens on a single port, but every API request specifies which model to use in the request body ("model": "llama3.2"). This means multiple applications can share the same Ollama server — each one simply requests a different model (or the same one). Ollama handles the multiplexing:

It loads models into memory on demand and keeps them loaded for 5 minutes by default (configurable via OLLAMA_KEEP_ALIVE).
It can hold multiple models in memory simultaneously (configurable via OLLAMA_MAX_LOADED_MODELS, which defaults to 3 per GPU).
It can handle parallel requests to the same model (configurable via OLLAMA_NUM_PARALLEL).

Per-request parameters like temperature, num_ctx, and top_p are sent as part of each API call, so different services can use different settings against the same model without conflicting.

# Service A talks to mistral
curl http://localhost:11434/api/chat -d '{"model":"mistral","messages":[{"role":"user","content":"Summarize DNS in one sentence."}],"stream":false}'

# Service B talks to llama3.2 on the same port
curl http://localhost:11434/api/chat -d '{"model":"llama3.2","messages":[{"role":"user","content":"List three HTTP status codes and what they mean."}],"stream":false}'

# Both are served by the same Ollama process

Note

Ollama has no built-in access control. Any client that can reach the port can use any model with any parameters. If you need to restrict which services can access which models, Butler is an access-control reverse proxy purpose-built for Ollama — it adds multi-user authentication (API keys, JWT, OIDC federation), per-user model policies with rate limiting, and structured audit logging. Alternatively, a general-purpose reverse proxy (nginx, Caddy) can add basic authentication. This is another good reason to keep OLLAMA_HOST bound to 127.0.0.1 unless you have a specific reason to expose it.

To keep models loaded in memory and avoid cold-start latency when switching between them:

sudo systemctl edit ollama.service

[Service]
Environment="OLLAMA_KEEP_ALIVE=-1"

The REST API

Ollama exposes a REST API on port 11434. The two most important endpoints:

Generate a Completion

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain DNS in one paragraph.",
  "stream": false
}'

Chat (Multi-Turn Conversation)

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is a VPN?"}
  ],
  "stream": false
}'

Full Endpoint Reference

Method	Endpoint	Description
POST	`/api/generate`	Text completion
POST	`/api/chat`	Chat completion (conversation)
POST	`/api/embed`	Generate embeddings
GET	`/api/tags`	List local models
POST	`/api/show`	Show model info
POST	`/api/pull`	Download a model
DELETE	`/api/delete`	Delete a model
POST	`/api/create`	Create a custom model
POST	`/api/copy`	Duplicate a model
GET	`/api/ps`	List running models

OpenAI-Compatible Endpoint

Ollama also serves an OpenAI-compatible API at /v1/chat/completions. This means you can use OpenAI client libraries with Ollama by changing the base URL:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required but unused
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Tip

Streaming is on by default for the native API. Set "stream": false for simpler scripting. The /v1 OpenAI-compatible endpoint also supports streaming via SSE.

Open WebUI

Open WebUI gives you a ChatGPT-like web interface for your local Ollama models. It runs as a Docker container and stores all data locally.

Quick Start with Docker

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 in your browser. The first account you create gets admin privileges.

The --add-host flag lets the container reach Ollama running on your host machine. The -v flag persists your chat history and settings across container restarts.

With NVIDIA GPU Passthrough

docker run -d -p 3000:8080 \
  --gpus all \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:cuda

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -e WEBUI_AUTH=False \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Docker Compose

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - '3000:8080'
    volumes:
      - open-webui:/app/backend/data
    extra_hosts:
      - 'host.docker.internal:host-gateway'
    restart: always

volumes:
  open-webui:

Updating Open WebUI

docker rm -f open-webui
docker pull ghcr.io/open-webui/open-webui:main
# Re-run the docker run command — the volume preserves your data

Without Docker

pip install open-webui
open-webui serve

Using Ollama with OpenCode

OpenCode is an open-source terminal-based AI coding assistant. You can connect it to Ollama for a fully local, private coding workflow.

Manual Configuration

Create or edit ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "ollama": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Ollama",
      "options": {
        "baseURL": "http://localhost:11434/v1"
      },
      "models": {
        "qwen2.5-coder:32b": {
          "name": "Qwen 2.5 Coder 32B"
        }
      }
    }
  }
}

Warning

OpenCode needs a generous context window. Ollama’s default context length is VRAM-dependent, but OpenCode works best at 64K or higher. Create a Modelfile with a larger num_ctx, or set the OLLAMA_CONTEXT_LENGTH environment variable.

Zero-Config with `ollama launch`

Ollama v0.15+ includes a launch command that handles all configuration automatically:

ollama launch                                # interactive tool picker
ollama launch opencode                       # launch OpenCode directly
ollama launch opencode --model qwen3-coder   # specify model

This also works with other coding tools:

ollama launch claude --model qwen3
ollama launch codex --model llama3.2

Recommended Models for Coding

Model	Context	Notes
`qwen3-coder`	256K	Purpose-built for agentic coding
`qwen2.5-coder:32b`	32K	Strong code generation, needs ~20 GB RAM
`qwen2.5-coder:7b`	32K	Good balance for 16 GB machines
`deepseek-coder:33b`	16K	Strong at code completion

Complete Removal

If you need to fully uninstall Ollama and clean up everything it left behind:

If Installed via Snap

sudo snap remove ollama

This removes the binary, the service, and any snap-managed data. You will still need to manually remove any models stored outside the snap (check ~/.ollama/ and /usr/share/ollama/).

If Installed via the Install Script

1. Stop and Remove the Service

sudo systemctl stop ollama
sudo systemctl disable ollama
sudo rm /etc/systemd/system/ollama.service
sudo systemctl daemon-reload

2. Remove the Binary and Libraries

sudo rm $(which ollama)
sudo rm -rf /usr/lib/ollama

3. Remove Downloaded Models and Data

sudo rm -rf /usr/share/ollama

If you changed the model storage location via OLLAMA_MODELS, remove that directory instead.

4. Remove the Service User and Group

The install script creates a dedicated ollama user and group:

sudo userdel ollama
sudo groupdel ollama

5. Remove User-Level Configuration

rm -rf ~/.ollama

After these steps, Ollama is completely gone from your system.

Building on Ollama

Once Ollama is running, it becomes the AI backend for a growing ecosystem of local-first tools. Two open-source projects that build directly on top of it:

Solux — a workflow automation engine that chains inputs, transforms, and Ollama LLM steps into YAML-defined pipelines. Summarize webpages, transcribe podcasts, classify documents, and push results to Slack, Obsidian, or a vector store — all running locally. Install with pip install solux and run solux init to connect to your Ollama instance.
Butler — an access-control reverse proxy for Ollama. If you share an Ollama server across multiple services or a homelab, Butler adds multi-user authentication (API keys, JWT standalone, OIDC federation), per-user model authorization and rate limiting, input filtering, and Prometheus observability. One Go binary, one YAML config, zero changes to Ollama or your clients.

For a hands-on example of using Ollama’s embedding and chat APIs together, see Build a Local RAG Pipeline with Ollama and ChromaDB.

Quick Reference

Cheat Sheet

# Install / update (inspect the script first — see Installation section)
curl -fsSL https://ollama.com/install.sh -o install-ollama.sh
less install-ollama.sh
sh install-ollama.sh

# Run a model
ollama run llama3.2

# Manage models
ollama pull mistral
ollama list
ollama show llama3.2
ollama ps
ollama rm mistral

# Custom model from Modelfile
ollama create my-model -f ./Modelfile

# Server management (Linux)
sudo systemctl status ollama
sudo systemctl restart ollama
sudo systemctl edit ollama.service    # set env vars

# API
curl http://localhost:11434/api/chat -d '{"model":"llama3.2","messages":[{"role":"user","content":"hi"}],"stream":false}'

# Open WebUI
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main

# OpenCode
ollama launch opencode

Installation

Linux

macOS and Windows

Basic CLI Usage

Managing Models

Pulling Models

Inspecting and Removing Models

Popular Models

Hardware, Model Sizes, and Performance

What “7B” Means

Quantization and File Size

VRAM Requirements

Practical VRAM Guide

CPU vs. GPU Performance

GPU vs. CPU Offloading

Context Length and Memory

Checking Your GPU

Keeping Ollama Updated

Modelfiles

Syntax

Key Parameters

Example: Focused Technical Assistant

Example: Creative Writing Model

Example: Seeded Conversation

Inspecting Existing Modelfiles

Tool-Calling Models

Which Models Support Tools?

How Tool Calling Works

Context Window Matters

Running Ollama as a Server

The Systemd Service

Manual Server

Exposing Ollama to the Network

Environment Variables

Serving Models for Other Local Services

Multiple Services, One Port

The REST API

Generate a Completion

Chat (Multi-Turn Conversation)

Full Endpoint Reference

OpenAI-Compatible Endpoint

Open WebUI

Quick Start with Docker

With NVIDIA GPU Passthrough

Single-User Mode (No Login)

Docker Compose

Updating Open WebUI

Without Docker

Using Ollama with OpenCode

Manual Configuration

Zero-Config with ollama launch

Recommended Models for Coding

Complete Removal

If Installed via Snap

If Installed via the Install Script

1. Stop and Remove the Service

2. Remove the Binary and Libraries

3. Remove Downloaded Models and Data

4. Remove the Service User and Group

5. Remove User-Level Configuration

Building on Ollama

Quick Reference

Cheat Sheet

Zero-Config with `ollama launch`