> ## Documentation Index
> Fetch the complete documentation index at: https://docs.wolffi.sh/llms.txt
> Use this file to discover all available pages before exploring further.

# Ollama

> Run AI models locally with zero internet dependency using Ollama

# Ollama Integration

Wolffish has complete, first-class integration with [Ollama](https://ollama.com) — an open-source local model runtime that lets you run LLMs entirely on your hardware. No API keys, no cloud dependency, no data leaving your machine.

## Why Ollama?

The deep goal of Wolffish is to run **completely local with zero exposure to the internet**. Every piece of data — memory, conversations, skills, task logs — already lives on your machine. The only component that traditionally requires the cloud is the LLM itself. Ollama closes that gap.

For power users with capable hardware, this means:

* **Total privacy** — Your prompts, responses, and tool outputs never leave your machine
* **Zero recurring cost** — No API bills, no token counting, no rate limits
* **Offline capability** — Full agentic workflows on an airplane, in a bunker, or behind an air-gapped network
* **No vendor dependency** — Your agent works regardless of API outages, pricing changes, or service discontinuation

This is the vision: a fully autonomous personal AI agent that operates as a local process on your own hardware, answering to no one but you.

## How It Works

Wolffish communicates with Ollama via its local HTTP API:

```
POST http://localhost:11434/api/chat
```

When you install Ollama and pull a model, Wolffish can:

1. **Detect** Ollama automatically on first launch (or later in Settings)
2. **Browse** available models on your machine
3. **Pull** new models directly from the Wolffish UI — no terminal needed
4. **Stream** responses using NDJSON streaming
5. **Call tools** via structured JSON in the model's response

The integration is seamless — select a local model and start chatting. Wolffish handles the message formatting, tool-call parsing, and streaming normalization internally.

## Setting Up Ollama

<Steps>
  <Step title="Install Ollama">
    Download from [ollama.com](https://ollama.com) and install. On macOS it's a single `.dmg`, on Linux a one-line curl command, on Windows a standard installer.
  </Step>

  <Step title="Pull a model">
    Either use the terminal (`ollama pull qwen3:14b`) or let Wolffish pull it for you from Settings → Models → Ollama.
  </Step>

  <Step title="Select in Wolffish">
    Open Settings → Models, select the Ollama tab, and choose your model. That's it — you're running local.
  </Step>
</Steps>

<Note>
  Ollama is **optional**. You can skip it entirely during onboarding and use only cloud providers (Claude, OpenAI). Wolffish will prompt you to configure at least one provider before you can start chatting.
</Note>

## Model Requirements for Agentic Tasks

Not all local models are equal. Wolffish's agentic capabilities — tool calling, multi-step reasoning, code execution, file manipulation — place specific demands on the model. Here's what you need to know:

### The Parameter Threshold

| Model Size | Conversational  | Simple Tools    | Multi-Step Agentic | Reliable Autonomous |
| ---------- | --------------- | --------------- | ------------------ | ------------------- |
| 1B–3B      | Basic chat only | Unreliable      | No                 | No                  |
| 7B–8B      | Good            | Inconsistent    | Struggles          | No                  |
| 14B        | Good            | Mostly reliable | Basic chains       | Fragile             |
| 32B–35B    | Excellent       | Reliable        | Handles well       | Sometimes           |
| 70B+       | Excellent       | Reliable        | Reliable           | Yes                 |

**The minimum for reliable agentic tool use is \~14B parameters**, but even then, complex multi-step workflows (research → write → format → post) will hit failure modes. For truly autonomous execution — where the agent chains 10+ tool calls without human intervention — you need **32B+ parameters** at minimum, and **70B+** for production-level reliability.

### Why Small Models Fail at Agentic Tasks

Tool calling requires the model to:

1. **Understand the instruction** — Parse what the user wants accomplished
2. **Plan the sequence** — Decide which tools to call, in what order
3. **Format tool calls correctly** — Output valid JSON with correct parameter names and types
4. **Interpret tool results** — Read the output and decide the next action
5. **Maintain context across turns** — Remember what it's already done across a multi-step chain
6. **Handle errors gracefully** — Retry, adjust, or ask for help when a tool fails

Small models (7B and below) typically fail at steps 3–6. They hallucinate parameter names, lose track of multi-step plans, output malformed JSON that breaks the tool-calling pipeline, and can't recover from errors. The result is a frustrating loop of retries that never converges.

### Recommended Models by Hardware

| RAM / VRAM  | Model                                   | Parameters | Agentic Reliability      |
| ----------- | --------------------------------------- | ---------- | ------------------------ |
| 8GB         | Qwen 3 8B, Gemma 3 4B                   | 4B–8B      | Conversation only        |
| 16GB        | Qwen 3 14B, Gemma 3 12B                 | 12B–14B    | Simple single-tool calls |
| 32GB        | Qwen 3 32B, QwQ 32B                     | 32B        | Multi-step workflows     |
| 48GB+       | Llama 3 70B, Qwen 2.5 72B               | 70B+       | Full autonomous agentic  |
| 64GB+ / GPU | DeepSeek-V2, Llama 3.1 405B (quantized) | 70B+       | Production-grade         |

<Warning>
  Quantization matters. A 70B model quantized to Q4\_0 fits in less RAM but loses capability. For agentic tasks, prefer Q5\_K\_M or higher quantization levels — the precision directly affects tool-call reliability.
</Warning>

### The Honest Truth

If you have a standard laptop with 8–16GB RAM, local models will handle conversations, summarization, and simple Q\&A well. But for the kind of autonomous multi-step workflows Wolffish excels at — researching topics, writing reports, managing files, executing shell commands in sequence — you'll get dramatically better results with a cloud provider like Claude or GPT-4.

The sweet spot for local-only agentic use:

* **Mac Studio / Mac Pro with 64GB+ unified memory** — Run 70B models at acceptable speed
* **Desktop with 24GB+ VRAM GPU** — Full-speed 70B inference via CUDA
* **High-end workstation with 128GB RAM** — Run quantized 100B+ models

For everyone else, we recommend: **cloud providers for complex agentic tasks, Ollama for privacy-sensitive conversations and offline fallback.**

## Reasoning modes

The **brain icon** next to the message box controls whether this model reasons. Click it to toggle reasoning on or off for models that support it.

### Thinking — *whether* the model reasons

* **Off** — the model answers immediately. Fastest; ideal for simple, direct tasks.
* **On** — the model first works through the problem in a dedicated reasoning pass before replying. Slower and uses more tokens, but markedly more accurate on multi-step, logical, or ambiguous tasks.

### Button states

| State | Colour | Meaning                      |
| ----- | ------ | ---------------------------- |
| Off   | gray   | Thinking off — direct answer |
| On    | blue   | Thinking on                  |

Each model shows only the states it genuinely supports. If a model can't reason, the button locks where there's nothing to change. Wolffish remembers your choice per model.

**On Ollama:** Reasoning is detected per pulled model from Ollama's capabilities — models that advertise a thinking capability (e.g. qwen3, deepseek-r1, gpt-oss) reason, others don't. It's a simple on/off with no effort tiers, and no API key or cost since it runs locally.

## Local-Only Mode

Wolffish includes a "Local Only" toggle that restricts all inference to Ollama — no data ever touches a cloud API, no matter which Brain model is selected. Enable it in the chat sidebar when you need absolute privacy for a sensitive task.

In local-only mode:

* Inference is forced to the local Ollama model regardless of which Brain you've selected
* No network requests are made for LLM inference
* Memory consolidation uses the local model
* All other features (memory, tools, capabilities) work normally

## Limitations

* **Speed** — Local inference is slower than cloud APIs, especially on CPU-only machines. Expect 5–30 tokens/second depending on model size and hardware, versus 80–150 tokens/second from cloud providers.
* **Context window** — Most local models support 4K–32K context. Cloud models offer 128K–200K. Long conversations or large documents may exceed local model limits.
* **Tool-call formatting** — Smaller models sometimes output malformed tool calls. Wolffish has retry logic, but repeated failures will end the turn.
* **No computer-use** — Computer-use (screen interaction) requires vision capabilities that most local models lack. This capability currently requires Claude.

## The Vision

We built Ollama integration because we believe the future of personal AI is local. Today, the best models are cloud-hosted. But model sizes are shrinking while capabilities grow. The gap between a 70B local model and a cloud frontier model narrows with every release.

Wolffish is built for that future — where a single machine runs a fully capable AI agent with no internet connection, no subscription, no data leaving your control. Every architectural decision (stateless model, markdown-first, local memory) is designed so that the day a 14B model can reliably execute 20-tool agentic chains, Wolffish is ready. No code changes needed — just swap the model.

Until then, use cloud providers for the heavy lifting and Ollama for what it does best: private, offline, always-available local inference.
