Tutorial

Ollama: Run LLMs Locally

Install and manage local LLMs with Ollama. Covers CLI usage, model management, Modelfiles, the REST API, tool-calling models, Open WebUI, and integration with coding tools like OpenCode.

19 min read beginner

Prerequisites

  • A machine with at least 8 GB RAM (16+ GB recommended)
  • Basic comfort with the Linux command line
Table of Contents

Ollama is the easiest way to run large language models locally. It handles downloading, quantization, GPU acceleration, and serving models behind a simple CLI and REST API. Think of it as Docker for LLMs.

This tutorial covers everything from installation through advanced configuration, so you can use Ollama as a local AI backend for coding tools, chat interfaces, and your own projects.

Installation

Linux

The officially recommended method is the install script:

curl -fsSL https://ollama.com/install.sh | sh

Warning

Piping scripts from the internet into your shell is a security risk — you are giving an external party root access to your machine. Always inspect the script first:

curl -fsSL https://ollama.com/install.sh -o install-ollama.sh
less install-ollama.sh   # read it, understand what it does
sh install-ollama.sh

The script calls sudo internally only where it needs elevated privileges (installing the binary, creating a systemd service, etc.), so do not run it with sudo — that would give the entire script root access unnecessarily. It is open source on GitHub if you want to review it before downloading.

Alternatively, Ollama is available as a snap — no script piping required:

sudo snap install ollama

The snap is published by the Ollama team and handles updates automatically through the snap daemon.

Verify the installation either way:

ollama --version

macOS and Windows

On macOS, download the .zip from ollama.com/download and drag the app into /Applications. The CLI is installed automatically.

On Windows, download and run the .exe installer from the same page.

Basic CLI Usage

Run a model interactively (downloading it first if needed):

ollama run llama3.2

This drops you into a chat session. Type your prompt and press Enter. Use """ to begin and end multiline input. Type /bye to exit.

Here is the full command reference:

CommandDescription
ollama run <model>Download (if needed) and run a model interactively
ollama pull <model>Download a model without running it
ollama listList all locally downloaded models
ollama show <model>Show model info (architecture, quantization, template, license)
ollama psList models currently loaded in memory
ollama stop <model>Unload a model from memory
ollama rm <model>Delete a downloaded model
ollama cp <src> <dest>Duplicate a model
ollama create <name> -f <Modelfile>Create a custom model from a Modelfile
ollama serveStart the Ollama server manually

Inside an interactive session, you can also inspect the running model:

CommandDescription
/show infoArchitecture and parameters
/show modelfileThe Modelfile used to build this model
/show systemSystem prompt
/show templatePrompt template
/show licenseLicense text

Managing Models

Pulling Models

Models follow a name:tag naming convention. The tag defaults to latest if omitted:

ollama pull llama3.2           # default tag (typically q4_K_M quantization)
ollama pull llama3.2:1b        # 1B parameter variant
ollama pull llama3:8b-instruct-q8_0  # specific quantization

Tags encode the parameter count, variant, and quantization level. The quantization suffixes control the tradeoff between model size and quality:

  • q4_0 — 4-bit, smallest and fastest
  • q4_K_M — Ollama’s default for most models, best balance of quality and size
  • q5_K_M — slightly better quality than q4_K_M if you have VRAM headroom
  • q8_0 — 8-bit, near-lossless quality, roughly double the size of q4

The K-quant variants (_K_S, _K_M, _K_L) use smarter quantization that preserves more accuracy at similar file sizes.

Inspecting and Removing Models

ollama list                  # see all downloaded models with sizes
ollama show llama3.2         # architecture, context length, quantization, etc.
ollama ps                    # which models are loaded, GPU/CPU split, memory usage
ollama rm llama3.2           # delete a model

Here are some worth knowing about:

ModelSizesNotes
llama3.21B, 3BCompact Meta models, good for constrained hardware
llama3.18B, 70B, 405BMeta’s flagship, most-downloaded model on Ollama
gemma31B—27BGoogle, multimodal (text + vision)
qwen30.6B—235BAlibaba, dense and mixture-of-experts variants
mistral7BMistral AI, fast and capable for its size
deepseek-r11.5B—671BStrong reasoning, chain-of-thought
phi414BMicrosoft, punches above its weight
qwen2.5-coder1.5B—32BCode-specialized
nomic-embed-textEmbedding model for RAG pipelines
llava7B, 13B, 34BVision-language model

Browse the full library at ollama.com/library.

Hardware, Model Sizes, and Performance

What “7B” Means

When a model is described as “7B” or “70B”, the number refers to billions of parameters — the learned weights that make up the neural network. More parameters generally means a more capable model, but also more memory and compute.

Think of it as a rough measure of the model’s “brain size”:

  • 1B—3B: Can follow simple instructions, summarize text, and answer basic questions. Prone to hallucination and poor at complex reasoning. Useful for lightweight tasks, autocomplete, or hardware-constrained environments.
  • 7B—8B: The sweet spot for most local use. Good at conversation, coding assistance, and general knowledge. Runs comfortably on a single consumer GPU.
  • 13B—14B: Noticeably smarter than 7-8B. Better at nuanced reasoning, longer documents, and following complex instructions.
  • 27B—32B: Approaching the quality of cloud-hosted models for many tasks. Requires a high-end GPU or Apple Silicon with 32+ GB of memory.
  • 70B+: Near frontier-model quality. Requires professional hardware, multi-GPU setups, or a Mac with 64+ GB of unified memory.

Quantization and File Size

You do not run models at full precision. Ollama uses quantization to compress model weights from 16-bit floats down to 4-8 bits, dramatically reducing file size and memory usage with surprisingly little quality loss.

The formula is straightforward: size in GB ≈ (parameters in billions × bits per weight) / 8. Here is what that looks like in practice:

Parametersq4_0 (4-bit)q4_K_M (~4.5-bit)q5_K_M (~5.5-bit)q8_0 (8-bit)f16 (16-bit)
1B~0.6 GB~0.7 GB~0.8 GB~1.1 GB~2 GB
3B~1.7 GB~2.0 GB~2.3 GB~3.2 GB~6 GB
7-8B~4 GB~4.5 GB~5.2 GB~8 GB~15 GB
13-14B~7 GB~8 GB~9.5 GB~14 GB~27 GB
27B~13.5 GB~16 GB~18.5 GB~27 GB~54 GB
32B~16 GB~19 GB~22 GB~32 GB~64 GB
70B~35 GB~40 GB~47 GB~70 GB~140 GB

These numbers represent both the download size and the memory consumed by model weights alone. Ollama defaults to q4_K_M for most models — it is the best balance of quality, speed, and memory.

VRAM Requirements

Total VRAM usage is model weights + KV cache + overhead. The KV cache stores the context of the conversation and grows with context length. At a 4K context window, it is relatively small (~0.5-1.5 GB), but at longer contexts it can exceed the size of the model itself.

Practical totals at q4_K_M with ~4K context:

ParametersWeights+ KV Cache + OverheadTotal VRAM
1B~0.7 GB~0.5 GB~1.5 GB
3B~2 GB~0.5 GB~3 GB
7-8B~4.5 GB~1 GB~6 GB
13-14B~8 GB~1.5 GB~10 GB
27B~16 GB~2 GB~18 GB
32B~19 GB~2.5 GB~22 GB
70B~40 GB~3 GB~43 GB

Practical VRAM Guide

If you are wondering which models your hardware can run:

Available VRAMWhat Fits (q4_K_M)Examples
4 GBUp to 3BLlama 3.2 1B/3B, Phi-3 Mini
6 GBUp to 7BMistral 7B, Qwen 2.5 7B
8 GBUp to 8BLlama 3.1 8B, Gemma 2 9B
12 GBUp to 13-14BPhi-4 14B, Llama 2 13B
16 GBUp to 14B with generous contextRTX 4070 Ti / 4080 territory
24 GBUp to 32BQwen 2.5 32B, DeepSeek R1 32B
48 GB70B modelsLlama 3.1 70B, Qwen 2.5 72B

Apple Silicon is a special case. Its unified memory is shared between CPU and GPU with no PCIe bottleneck, so a 32 GB Mac can run models that would require a $1500 NVIDIA GPU on a PC. A 64 GB Mac can run 70B models comfortably.

CPU vs. GPU Performance

The difference is dramatic. Here are rough token generation speeds for a 7-8B model at q4_K_M:

Hardware~Tokens/secFeel
Laptop CPU (Intel i7)7—10Usable but slow
Desktop CPU (Ryzen 7)12—20Comfortable for chat
RTX 3060 (12 GB)50—70Smooth
RTX 4070 Ti (16 GB)50—80Smooth
RTX 4090 (24 GB)100—140Instant
Apple M1 Pro/Max25—40Good
Apple M3 Max45—55Very good
Apple M4 Max55—60Excellent

As a rough reference: 2 tok/s feels like watching someone type slowly. 10 tok/s is comfortable for reading chat responses. 30+ tok/s is what you want for coding tools and streaming integrations.

Tip

For Apple Silicon, memory bandwidth matters more than chip generation. An M3 Max (400 GB/s) can outperform an M4 Pro (273 GB/s) for LLM inference because bandwidth is the bottleneck. When choosing a Mac for local LLMs, prioritize: (1) enough unified memory to fit the model, then (2) maximum memory bandwidth.

GPU vs. CPU Offloading

When a model does not fully fit in VRAM, Ollama automatically splits layers between GPU and CPU. You can see this in ollama ps:

$ ollama ps
NAME              ID              SIZE      PROCESSOR          UNTIL
llama3.1:8b       a23b46a1c3e2    6.3 GB    100% GPU           4 minutes from now
deepseek-r1:70b   a951a23b46a1    42 GB     78%/22% CPU/GPU    4 minutes from now

The PROCESSOR column tells you what is happening. 100% GPU is ideal. Any CPU/GPU split means significantly degraded performance — the GPU has to wait for the CPU to process its layers, and PCIe bandwidth (~16 GB/s) is a fraction of VRAM bandwidth (~900 GB/s on an RTX 4090).

Warning

A model that is 90% in VRAM and 10% on CPU will be dramatically slower than one at 100% GPU. If a model barely does not fit, it is often better to drop to a smaller model or a lower quantization rather than accept the split.

Context Length and Memory

Longer context windows consume more VRAM through the KV cache. This is where people often run out of memory unexpectedly:

Context LengthKV Cache (8B model, f16)Total VRAM with q4_K_M Weights
4,096~0.5 GB~6 GB
16,384~1.8 GB~7 GB
32,768~3.6 GB~9 GB
65,536~7.2 GB~12 GB
131,072~14.4 GB~20 GB

At 128K context, the KV cache alone uses more memory than the model weights. If you need long contexts on limited VRAM, quantize the KV cache:

sudo systemctl edit ollama.service
[Service]
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"

This halves KV cache memory with negligible quality impact. Setting q4_0 cuts it to a quarter.

Checking Your GPU

To verify Ollama detected your GPU:

ollama run llama3.2:3b "hello"   # load a model
ollama ps                         # check the Processor column

For more detail, check the server logs:

journalctl -u ollama --no-pager | grep -i gpu

On NVIDIA systems, nvidia-smi shows real-time VRAM usage while a model is loaded.

Keeping Ollama Updated

If you installed via snap, updates happen automatically. You can also trigger one manually:

sudo snap refresh ollama

If you installed via the install script, re-run it (with the same inspect-first caveat from the installation section). It safely overwrites the existing binary while preserving your models:

curl -fsSL https://ollama.com/install.sh -o install-ollama.sh
less install-ollama.sh
sh install-ollama.sh

Tip

When upgrading from a significantly older version, first remove stale libraries: sudo rm -rf /usr/lib/ollama

On macOS, the app auto-detects updates — click the menubar icon and select “Restart to update.” On Windows, download and run the latest installer.

Models are stored separately from the Ollama binary, so they survive upgrades. Default locations:

  • Linux: /usr/share/ollama/.ollama/models
  • macOS: ~/.ollama/models
  • Windows: C:\Users\%username%\.ollama\models

Modelfiles

Modelfiles are to Ollama what Dockerfiles are to Docker: declarative configuration files that define a custom model. They let you set a system prompt, tune parameters, and seed conversation examples on top of a base model.

Syntax

InstructionRequiredDescription
FROMYesBase model name or path to a GGUF file
SYSTEMNoSystem prompt defining the model’s role and behavior
PARAMETERNoRuntime parameter (temperature, context size, etc.)
TEMPLATENoGo template for prompt formatting
ADAPTERNoPath to a LoRA fine-tuned adapter
MESSAGENoSeed conversation examples (user, assistant, system)
LICENSENoLicense text for the model

Key Parameters

ParameterDefaultDescription
temperature0.8Creativity (0.0 = deterministic, 1.0+ = creative)
num_ctx2048Context window size in tokens
top_k40Limit token selection to top K candidates
top_p0.9Nucleus sampling threshold
repeat_penalty1.0Penalize repeated tokens
num_predict-1Max tokens to generate (-1 = unlimited)
stopStop sequence(s) to halt generation
seed0Random seed (0 = random, set for reproducibility)

Example: Focused Technical Assistant

Create a file called Modelfile:

FROM llama3.2

PARAMETER temperature 0.3
PARAMETER num_ctx 4096

SYSTEM """You are an expert technical assistant. Provide clear,
accurate, and concise answers to programming and system
administration questions. Include code examples when helpful."""

Build and run it:

ollama create tech-assistant -f ./Modelfile
ollama run tech-assistant

Example: Creative Writing Model

FROM llama3.2

PARAMETER temperature 0.9
PARAMETER num_ctx 8192

SYSTEM """You are a creative writing assistant. Generate engaging,
imaginative prose with vivid descriptions."""

Example: Seeded Conversation

You can prime the model with example exchanges using MESSAGE:

FROM llama3.2

PARAMETER temperature 0.5

SYSTEM You are a helpful coding tutor who explains concepts with examples.

MESSAGE user How do I declare a variable in Python?
MESSAGE assistant In Python, you simply assign a value: `x = 42`. No type declaration needed — Python infers the type at runtime.

Inspecting Existing Modelfiles

To see the Modelfile behind any model:

ollama show --modelfile llama3.2

This is useful for understanding how official models are configured before building your own.

Tool-Calling Models

Some models support tool use (also called function calling), meaning the model can request that your application execute a function and return the result. This is how LLMs interact with filesystems, databases, APIs, and other external systems.

Which Models Support Tools?

Not all models support tool calling. Models that do include:

  • llama3.1 and llama3.2
  • qwen3 and qwen2.5
  • mistral and mistral-nemo
  • gemma3
  • command-r and command-r-plus

You can check whether a model supports tools by looking at its page on ollama.com/library — models that support tools are tagged accordingly.

How Tool Calling Works

Tool calling is used through the /api/chat endpoint. You send a list of available tools (as JSON Schema function definitions) along with the conversation, and the model may respond with a tool_calls message instead of a regular text response. Your application then executes the function, sends the result back, and the model incorporates it into its answer.

Here is the flow:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "user", "content": "What files are in /tmp?"}
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "list_directory",
        "description": "List files in a directory",
        "parameters": {
          "type": "object",
          "properties": {
            "path": {
              "type": "string",
              "description": "The directory path to list"
            }
          },
          "required": ["path"]
        }
      }
    }
  ],
  "stream": false
}'

If the model decides to use the tool, the response will contain a tool_calls array instead of a regular content field:

{
  "message": {
    "role": "assistant",
    "tool_calls": [
      {
        "function": {
          "name": "list_directory",
          "arguments": { "path": "/tmp" }
        }
      }
    ]
  }
}

Your application executes the function, then sends the result back as a tool role message:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "user", "content": "What files are in /tmp?"},
    {"role": "assistant", "tool_calls": [{"function": {"name": "list_directory", "arguments": {"path": "/tmp"}}}]},
    {"role": "tool", "content": "file1.txt\nfile2.log\ndata.csv"}
  ],
  "stream": false
}'

The model then responds with a natural language answer incorporating the tool result.

Warning

Tool calling requires careful prompt engineering with smaller models. If a model is not reliably generating tool calls, try a larger variant (e.g., move from 8B to 70B) or increase num_ctx to give the model more room to reason about the tool definitions.

Context Window Matters

Tool definitions are injected into the prompt and consume context tokens. If you define many tools or tools with complex schemas, you may need to increase num_ctx:

FROM llama3.2
PARAMETER num_ctx 8192

Applications like OpenCode and Open WebUI handle the tool-calling loop automatically — you only need to deal with it directly when building your own integrations.

Running Ollama as a Server

The Systemd Service

On Linux, the install script sets up Ollama as a systemd service that starts automatically:

sudo systemctl status ollama     # check status
sudo systemctl start ollama      # start
sudo systemctl stop ollama       # stop
sudo systemctl restart ollama    # restart

You can verify the server is running by hitting the root endpoint:

curl http://localhost:11434
# Ollama is running

Manual Server

If you are not using systemd, start the server yourself:

ollama serve

By default, Ollama binds to 127.0.0.1:11434 (localhost only).

Exposing Ollama to the Network

To let other machines or services on your LAN reach Ollama, change the bind address. For systemd:

sudo systemctl edit ollama.service

Add under [Service]:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

Then reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Warning

Setting OLLAMA_HOST=0.0.0.0 exposes Ollama to your entire network. Only do this on trusted networks, or use a reverse proxy like Butler to add multi-user authentication (API keys, JWT, or OIDC), model-level authorization, and rate limiting.

For a manual server, just export the variable:

OLLAMA_HOST=0.0.0.0:11434 ollama serve

Note

Shell-exported environment variables do not apply to systemd services. You must use systemctl edit for persistent configuration.

Environment Variables

These control server behavior. Set them via systemctl edit on Linux, launchctl setenv on macOS, or system environment variables on Windows:

VariableDefaultDescription
OLLAMA_HOST127.0.0.1:11434Bind address and port
OLLAMA_MODELSOS-specificModel storage directory
OLLAMA_ORIGINSAllowed CORS origins (e.g., * for dev)
OLLAMA_KEEP_ALIVE5mHow long models stay loaded (-1 = forever, 0 = unload immediately)
OLLAMA_NUM_PARALLEL1Max parallel requests per model
OLLAMA_MAX_LOADED_MODELSAutoMax models loaded simultaneously
OLLAMA_CONTEXT_LENGTHVRAM-dependentDefault context window (4K/32K/256K based on available VRAM)
OLLAMA_FLASH_ATTENTIONSet to 1 to enable flash attention (saves VRAM)
OLLAMA_KV_CACHE_TYPEf16K/V cache quantization: q8_0 (half memory) or q4_0 (quarter)

Serving Models for Other Local Services

If you have a project that expects an LLM API on a specific port, you can point it at Ollama. For example, a project at ../linkedin-copilot that expects an OpenAI-compatible API:

# Ensure Ollama is running
sudo systemctl start ollama

# Pull the model you want to serve
ollama pull llama3.2

# Point your application at Ollama's OpenAI-compatible endpoint
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=ollama   # any non-empty string works

Ollama’s /v1 endpoint is a drop-in replacement for the OpenAI API, so most tools that support a configurable OpenAI base URL will work without modification.

Multiple Services, One Port

Ollama listens on a single port, but every API request specifies which model to use in the request body ("model": "llama3.2"). This means multiple applications can share the same Ollama server — each one simply requests a different model (or the same one). Ollama handles the multiplexing:

  • It loads models into memory on demand and keeps them loaded for 5 minutes by default (configurable via OLLAMA_KEEP_ALIVE).
  • It can hold multiple models in memory simultaneously (configurable via OLLAMA_MAX_LOADED_MODELS, which defaults to 3 per GPU).
  • It can handle parallel requests to the same model (configurable via OLLAMA_NUM_PARALLEL).

Per-request parameters like temperature, num_ctx, and top_p are sent as part of each API call, so different services can use different settings against the same model without conflicting.

# Service A talks to mistral
curl http://localhost:11434/api/chat -d '{"model":"mistral","messages":[{"role":"user","content":"Summarize DNS in one sentence."}],"stream":false}'

# Service B talks to llama3.2 on the same port
curl http://localhost:11434/api/chat -d '{"model":"llama3.2","messages":[{"role":"user","content":"List three HTTP status codes and what they mean."}],"stream":false}'

# Both are served by the same Ollama process

Note

Ollama has no built-in access control. Any client that can reach the port can use any model with any parameters. If you need to restrict which services can access which models, Butler is an access-control reverse proxy purpose-built for Ollama — it adds multi-user authentication (API keys, JWT, OIDC federation), per-user model policies with rate limiting, and structured audit logging. Alternatively, a general-purpose reverse proxy (nginx, Caddy) can add basic authentication. This is another good reason to keep OLLAMA_HOST bound to 127.0.0.1 unless you have a specific reason to expose it.

To keep models loaded in memory and avoid cold-start latency when switching between them:

sudo systemctl edit ollama.service
[Service]
Environment="OLLAMA_KEEP_ALIVE=-1"

The REST API

Ollama exposes a REST API on port 11434. The two most important endpoints:

Generate a Completion

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain DNS in one paragraph.",
  "stream": false
}'

Chat (Multi-Turn Conversation)

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is a VPN?"}
  ],
  "stream": false
}'

Full Endpoint Reference

MethodEndpointDescription
POST/api/generateText completion
POST/api/chatChat completion (conversation)
POST/api/embedGenerate embeddings
GET/api/tagsList local models
POST/api/showShow model info
POST/api/pullDownload a model
DELETE/api/deleteDelete a model
POST/api/createCreate a custom model
POST/api/copyDuplicate a model
GET/api/psList running models

OpenAI-Compatible Endpoint

Ollama also serves an OpenAI-compatible API at /v1/chat/completions. This means you can use OpenAI client libraries with Ollama by changing the base URL:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required but unused
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Tip

Streaming is on by default for the native API. Set "stream": false for simpler scripting. The /v1 OpenAI-compatible endpoint also supports streaming via SSE.

Open WebUI

Open WebUI gives you a ChatGPT-like web interface for your local Ollama models. It runs as a Docker container and stores all data locally.

Quick Start with Docker

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 in your browser. The first account you create gets admin privileges.

The --add-host flag lets the container reach Ollama running on your host machine. The -v flag persists your chat history and settings across container restarts.

With NVIDIA GPU Passthrough

docker run -d -p 3000:8080 \
  --gpus all \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:cuda

Single-User Mode (No Login)

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -e WEBUI_AUTH=False \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Docker Compose

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - '3000:8080'
    volumes:
      - open-webui:/app/backend/data
    extra_hosts:
      - 'host.docker.internal:host-gateway'
    restart: always

volumes:
  open-webui:

Updating Open WebUI

docker rm -f open-webui
docker pull ghcr.io/open-webui/open-webui:main
# Re-run the docker run command — the volume preserves your data

Without Docker

pip install open-webui
open-webui serve

Using Ollama with OpenCode

OpenCode is an open-source terminal-based AI coding assistant. You can connect it to Ollama for a fully local, private coding workflow.

Manual Configuration

Create or edit ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "ollama": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Ollama",
      "options": {
        "baseURL": "http://localhost:11434/v1"
      },
      "models": {
        "qwen2.5-coder:32b": {
          "name": "Qwen 2.5 Coder 32B"
        }
      }
    }
  }
}

Warning

OpenCode needs a generous context window. Ollama’s default context length is VRAM-dependent, but OpenCode works best at 64K or higher. Create a Modelfile with a larger num_ctx, or set the OLLAMA_CONTEXT_LENGTH environment variable.

Zero-Config with ollama launch

Ollama v0.15+ includes a launch command that handles all configuration automatically:

ollama launch                                # interactive tool picker
ollama launch opencode                       # launch OpenCode directly
ollama launch opencode --model qwen3-coder   # specify model

This also works with other coding tools:

ollama launch claude --model qwen3
ollama launch codex --model llama3.2
ModelContextNotes
qwen3-coder256KPurpose-built for agentic coding
qwen2.5-coder:32b32KStrong code generation, needs ~20 GB RAM
qwen2.5-coder:7b32KGood balance for 16 GB machines
deepseek-coder:33b16KStrong at code completion

Complete Removal

If you need to fully uninstall Ollama and clean up everything it left behind:

If Installed via Snap

sudo snap remove ollama

This removes the binary, the service, and any snap-managed data. You will still need to manually remove any models stored outside the snap (check ~/.ollama/ and /usr/share/ollama/).

If Installed via the Install Script

1. Stop and Remove the Service

sudo systemctl stop ollama
sudo systemctl disable ollama
sudo rm /etc/systemd/system/ollama.service
sudo systemctl daemon-reload

2. Remove the Binary and Libraries

sudo rm $(which ollama)
sudo rm -rf /usr/lib/ollama

3. Remove Downloaded Models and Data

sudo rm -rf /usr/share/ollama

If you changed the model storage location via OLLAMA_MODELS, remove that directory instead.

4. Remove the Service User and Group

The install script creates a dedicated ollama user and group:

sudo userdel ollama
sudo groupdel ollama

5. Remove User-Level Configuration

rm -rf ~/.ollama

After these steps, Ollama is completely gone from your system.

Building on Ollama

Once Ollama is running, it becomes the AI backend for a growing ecosystem of local-first tools. Two open-source projects that build directly on top of it:

  • Solux — a workflow automation engine that chains inputs, transforms, and Ollama LLM steps into YAML-defined pipelines. Summarize webpages, transcribe podcasts, classify documents, and push results to Slack, Obsidian, or a vector store — all running locally. Install with pip install solux and run solux init to connect to your Ollama instance.

  • Butler — an access-control reverse proxy for Ollama. If you share an Ollama server across multiple services or a homelab, Butler adds multi-user authentication (API keys, JWT standalone, OIDC federation), per-user model authorization and rate limiting, input filtering, and Prometheus observability. One Go binary, one YAML config, zero changes to Ollama or your clients.

For a hands-on example of using Ollama’s embedding and chat APIs together, see Build a Local RAG Pipeline with Ollama and ChromaDB.

Quick Reference

Cheat Sheet

# Install / update (inspect the script first — see Installation section)
curl -fsSL https://ollama.com/install.sh -o install-ollama.sh
less install-ollama.sh
sh install-ollama.sh

# Run a model
ollama run llama3.2

# Manage models
ollama pull mistral
ollama list
ollama show llama3.2
ollama ps
ollama rm mistral

# Custom model from Modelfile
ollama create my-model -f ./Modelfile

# Server management (Linux)
sudo systemctl status ollama
sudo systemctl restart ollama
sudo systemctl edit ollama.service    # set env vars

# API
curl http://localhost:11434/api/chat -d '{"model":"llama3.2","messages":[{"role":"user","content":"hi"}],"stream":false}'

# Open WebUI
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main

# OpenCode
ollama launch opencode