Ollama Integration

Wolffish has complete, first-class integration with Ollama — an open-source local model runtime that lets you run LLMs entirely on your hardware. No API keys, no cloud dependency, no data leaving your machine.

Why Ollama?

The deep goal of Wolffish is to run completely local with zero exposure to the internet. Every piece of data — memory, conversations, skills, task logs — already lives on your machine. The only component that traditionally requires the cloud is the LLM itself. Ollama closes that gap. For power users with capable hardware, this means:

Total privacy — Your prompts, responses, and tool outputs never leave your machine
Zero recurring cost — No API bills, no token counting, no rate limits
Offline capability — Full agentic workflows on an airplane, in a bunker, or behind an air-gapped network
No vendor dependency — Your agent works regardless of API outages, pricing changes, or service discontinuation

This is the vision: a fully autonomous personal AI agent that operates as a local process on your own hardware, answering to no one but you.

How It Works

Wolffish communicates with Ollama via its local HTTP API:

POST http://localhost:11434/api/chat

When you install Ollama and pull a model, Wolffish can:

Detect Ollama automatically on first launch (or later in Settings)
Browse available models on your machine
Pull new models directly from the Wolffish UI — no terminal needed
Stream responses using NDJSON streaming
Call tools via structured JSON in the model’s response

The integration is seamless — select a local model and start chatting. Wolffish handles the message formatting, tool-call parsing, and streaming normalization internally.

Setting Up Ollama

Install Ollama

Download from ollama.com and install. On macOS it’s a single .dmg, on Linux a one-line curl command, on Windows a standard installer.

Pull a model

Either use the terminal (ollama pull qwen3:14b) or let Wolffish pull it for you from Settings → Models → Ollama.

Select in Wolffish

Open Settings → Models, select the Ollama tab, and choose your model. That’s it — you’re running local.

Ollama is optional. You can skip it entirely during onboarding and use only cloud providers (Claude, OpenAI). Wolffish will prompt you to configure at least one provider before you can start chatting.

Model Requirements for Agentic Tasks

Not all local models are equal. Wolffish’s agentic capabilities — tool calling, multi-step reasoning, code execution, file manipulation — place specific demands on the model. Here’s what you need to know:

The Parameter Threshold

Model Size	Conversational	Simple Tools	Multi-Step Agentic	Reliable Autonomous
1B–3B	Basic chat only	Unreliable	No	No
7B–8B	Good	Inconsistent	Struggles	No
14B	Good	Mostly reliable	Basic chains	Fragile
32B–35B	Excellent	Reliable	Handles well	Sometimes
70B+	Excellent	Reliable	Reliable	Yes

The minimum for reliable agentic tool use is ~14B parameters, but even then, complex multi-step workflows (research → write → format → post) will hit failure modes. For truly autonomous execution — where the agent chains 10+ tool calls without human intervention — you need 32B+ parameters at minimum, and 70B+ for production-level reliability.

Why Small Models Fail at Agentic Tasks

Tool calling requires the model to:

Understand the instruction — Parse what the user wants accomplished
Plan the sequence — Decide which tools to call, in what order
Format tool calls correctly — Output valid JSON with correct parameter names and types
Interpret tool results — Read the output and decide the next action
Maintain context across turns — Remember what it’s already done across a multi-step chain
Handle errors gracefully — Retry, adjust, or ask for help when a tool fails

Small models (7B and below) typically fail at steps 3–6. They hallucinate parameter names, lose track of multi-step plans, output malformed JSON that breaks the tool-calling pipeline, and can’t recover from errors. The result is a frustrating loop of retries that never converges.

Recommended Models by Hardware

RAM / VRAM	Model	Parameters	Agentic Reliability
8GB	Qwen 3 8B, Gemma 3 4B	4B–8B	Conversation only
16GB	Qwen 3 14B, Gemma 3 12B	12B–14B	Simple single-tool calls
32GB	Qwen 3 32B, QwQ 32B	32B	Multi-step workflows
48GB+	Llama 3 70B, Qwen 2.5 72B	70B+	Full autonomous agentic
64GB+ / GPU	DeepSeek-V2, Llama 3.1 405B (quantized)	70B+	Production-grade

Quantization matters. A 70B model quantized to Q4_0 fits in less RAM but loses capability. For agentic tasks, prefer Q5_K_M or higher quantization levels — the precision directly affects tool-call reliability.

The Honest Truth

If you have a standard laptop with 8–16GB RAM, local models will handle conversations, summarization, and simple Q&A well. But for the kind of autonomous multi-step workflows Wolffish excels at — researching topics, writing reports, managing files, executing shell commands in sequence — you’ll get dramatically better results with a cloud provider like Claude or GPT-4. The sweet spot for local-only agentic use:

Mac Studio / Mac Pro with 64GB+ unified memory — Run 70B models at acceptable speed
Desktop with 24GB+ VRAM GPU — Full-speed 70B inference via CUDA
High-end workstation with 128GB RAM — Run quantized 100B+ models

For everyone else, we recommend: cloud providers for complex agentic tasks, Ollama for privacy-sensitive conversations and offline fallback.

Reasoning modes

The brain icon next to the message box controls whether this model reasons. Click it to toggle reasoning on or off for models that support it.

Thinking — whether the model reasons

Off — the model answers immediately. Fastest; ideal for simple, direct tasks.
On — the model first works through the problem in a dedicated reasoning pass before replying. Slower and uses more tokens, but markedly more accurate on multi-step, logical, or ambiguous tasks.

Button states

State	Colour	Meaning
Off	gray	Thinking off — direct answer
On	blue	Thinking on

Each model shows only the states it genuinely supports. If a model can’t reason, the button locks where there’s nothing to change. Wolffish remembers your choice per model. On Ollama: Reasoning is detected per pulled model from Ollama’s capabilities — models that advertise a thinking capability (e.g. qwen3, deepseek-r1, gpt-oss) reason, others don’t. It’s a simple on/off with no effort tiers, and no API key or cost since it runs locally.

Local-Only Mode

Wolffish includes a “Local Only” toggle that restricts all inference to Ollama — no data ever touches a cloud API, no matter which Brain model is selected. Enable it in the chat sidebar when you need absolute privacy for a sensitive task. In local-only mode:

Inference is forced to the local Ollama model regardless of which Brain you’ve selected
No network requests are made for LLM inference
Memory consolidation uses the local model
All other features (memory, tools, capabilities) work normally

Limitations

Speed — Local inference is slower than cloud APIs, especially on CPU-only machines. Expect 5–30 tokens/second depending on model size and hardware, versus 80–150 tokens/second from cloud providers.
Context window — Most local models support 4K–32K context. Cloud models offer 128K–200K. Long conversations or large documents may exceed local model limits.
Tool-call formatting — Smaller models sometimes output malformed tool calls. Wolffish has retry logic, but repeated failures will end the turn.
No computer-use — Computer-use (screen interaction) requires vision capabilities that most local models lack. This capability currently requires Claude.

The Vision

We built Ollama integration because we believe the future of personal AI is local. Today, the best models are cloud-hosted. But model sizes are shrinking while capabilities grow. The gap between a 70B local model and a cloud frontier model narrows with every release. Wolffish is built for that future — where a single machine runs a fully capable AI agent with no internet connection, no subscription, no data leaving your control. Every architectural decision (stateless model, markdown-first, local memory) is designed so that the day a 14B model can reliably execute 20-tool agentic chains, Wolffish is ready. No code changes needed — just swap the model. Until then, use cloud providers for the heavy lifting and Ollama for what it does best: private, offline, always-available local inference.

​Ollama Integration

​Why Ollama?

​How It Works

​Setting Up Ollama

​Model Requirements for Agentic Tasks

​The Parameter Threshold

​Why Small Models Fail at Agentic Tasks

​Recommended Models by Hardware

​The Honest Truth

​Reasoning modes

​Thinking — whether the model reasons

​Button states

​Local-Only Mode

​Limitations

​The Vision

Ollama Integration

Why Ollama?

How It Works

Setting Up Ollama

Model Requirements for Agentic Tasks

The Parameter Threshold

Why Small Models Fail at Agentic Tasks

Recommended Models by Hardware

The Honest Truth

Reasoning modes

Thinking — whether the model reasons

Button states

Local-Only Mode

Limitations

The Vision