Skip to main content

Ollama Integration

Wolffish has complete, first-class integration with Ollama — an open-source local model runtime that lets you run LLMs entirely on your hardware. No API keys, no cloud dependency, no data leaving your machine.

Why Ollama?

The deep goal of Wolffish is to run completely local with zero exposure to the internet. Every piece of data — memory, conversations, skills, task logs — already lives on your machine. The only component that traditionally requires the cloud is the LLM itself. Ollama closes that gap. For power users with capable hardware, this means:
  • Total privacy — Your prompts, responses, and tool outputs never leave your machine
  • Zero recurring cost — No API bills, no token counting, no rate limits
  • Offline capability — Full agentic workflows on an airplane, in a bunker, or behind an air-gapped network
  • No vendor dependency — Your agent works regardless of API outages, pricing changes, or service discontinuation
This is the vision: a fully autonomous personal AI agent that operates as a local process on your own hardware, answering to no one but you.

How It Works

Wolffish communicates with Ollama via its local HTTP API:
POST http://localhost:11434/api/chat
When you install Ollama and pull a model, Wolffish can:
  1. Detect Ollama automatically on first launch (or later in Settings)
  2. Browse available models on your machine
  3. Pull new models directly from the Wolffish UI — no terminal needed
  4. Stream responses using NDJSON streaming
  5. Call tools via structured JSON in the model’s response
The integration is seamless — select a local model and start chatting. Wolffish handles the message formatting, tool-call parsing, and streaming normalization internally.

Setting Up Ollama

1

Install Ollama

Download from ollama.com and install. On macOS it’s a single .dmg, on Linux a one-line curl command, on Windows a standard installer.
2

Pull a model

Either use the terminal (ollama pull qwen3:14b) or let Wolffish pull it for you from Settings → Models → Ollama.
3

Select in Wolffish

Open Settings → Models, select the Ollama tab, and choose your model. That’s it — you’re running local.
Ollama is optional. You can skip it entirely during onboarding and use only cloud providers (Claude, OpenAI). Wolffish will prompt you to configure at least one provider before you can start chatting.

Model Requirements for Agentic Tasks

Not all local models are equal. Wolffish’s agentic capabilities — tool calling, multi-step reasoning, code execution, file manipulation — place specific demands on the model. Here’s what you need to know:

The Parameter Threshold

Model SizeConversationalSimple ToolsMulti-Step AgenticReliable Autonomous
1B–3BBasic chat onlyUnreliableNoNo
7B–8BGoodInconsistentStrugglesNo
14BGoodMostly reliableBasic chainsFragile
32B–35BExcellentReliableHandles wellSometimes
70B+ExcellentReliableReliableYes
The minimum for reliable agentic tool use is ~14B parameters, but even then, complex multi-step workflows (research → write → format → post) will hit failure modes. For truly autonomous execution — where the agent chains 10+ tool calls without human intervention — you need 32B+ parameters at minimum, and 70B+ for production-level reliability.

Why Small Models Fail at Agentic Tasks

Tool calling requires the model to:
  1. Understand the instruction — Parse what the user wants accomplished
  2. Plan the sequence — Decide which tools to call, in what order
  3. Format tool calls correctly — Output valid JSON with correct parameter names and types
  4. Interpret tool results — Read the output and decide the next action
  5. Maintain context across turns — Remember what it’s already done across a multi-step chain
  6. Handle errors gracefully — Retry, adjust, or ask for help when a tool fails
Small models (7B and below) typically fail at steps 3–6. They hallucinate parameter names, lose track of multi-step plans, output malformed JSON that breaks the tool-calling pipeline, and can’t recover from errors. The result is a frustrating loop of retries that never converges.
RAM / VRAMModelParametersAgentic Reliability
8GBQwen 3 8B, Gemma 3 4B4B–8BConversation only
16GBQwen 3 14B, Gemma 3 12B12B–14BSimple single-tool calls
32GBQwen 3 32B, QwQ 32B32BMulti-step workflows
48GB+Llama 3 70B, Qwen 2.5 72B70B+Full autonomous agentic
64GB+ / GPUDeepSeek-V2, Llama 3.1 405B (quantized)70B+Production-grade
Quantization matters. A 70B model quantized to Q4_0 fits in less RAM but loses capability. For agentic tasks, prefer Q5_K_M or higher quantization levels — the precision directly affects tool-call reliability.

The Honest Truth

If you have a standard laptop with 8–16GB RAM, local models will handle conversations, summarization, and simple Q&A well. But for the kind of autonomous multi-step workflows Wolffish excels at — researching topics, writing reports, managing files, executing shell commands in sequence — you’ll get dramatically better results with a cloud provider like Claude or GPT-4. The sweet spot for local-only agentic use:
  • Mac Studio / Mac Pro with 64GB+ unified memory — Run 70B models at acceptable speed
  • Desktop with 24GB+ VRAM GPU — Full-speed 70B inference via CUDA
  • High-end workstation with 128GB RAM — Run quantized 100B+ models
For everyone else, we recommend: cloud providers for complex agentic tasks, Ollama for privacy-sensitive conversations and offline fallback.

Reasoning modes

The brain icon next to the message box controls whether this model reasons. Click it to toggle reasoning on or off for models that support it.

Thinking — whether the model reasons

  • Off — the model answers immediately. Fastest; ideal for simple, direct tasks.
  • On — the model first works through the problem in a dedicated reasoning pass before replying. Slower and uses more tokens, but markedly more accurate on multi-step, logical, or ambiguous tasks.

Button states

StateColourMeaning
OffgrayThinking off — direct answer
OnblueThinking on
Each model shows only the states it genuinely supports. If a model can’t reason, the button locks where there’s nothing to change. Wolffish remembers your choice per model. On Ollama: Reasoning is detected per pulled model from Ollama’s capabilities — models that advertise a thinking capability (e.g. qwen3, deepseek-r1, gpt-oss) reason, others don’t. It’s a simple on/off with no effort tiers, and no API key or cost since it runs locally.

Local-Only Mode

Wolffish includes a “Local Only” toggle that restricts all inference to Ollama — no data ever touches a cloud API, no matter which Brain model is selected. Enable it in the chat sidebar when you need absolute privacy for a sensitive task. In local-only mode:
  • Inference is forced to the local Ollama model regardless of which Brain you’ve selected
  • No network requests are made for LLM inference
  • Memory consolidation uses the local model
  • All other features (memory, tools, capabilities) work normally

Limitations

  • Speed — Local inference is slower than cloud APIs, especially on CPU-only machines. Expect 5–30 tokens/second depending on model size and hardware, versus 80–150 tokens/second from cloud providers.
  • Context window — Most local models support 4K–32K context. Cloud models offer 128K–200K. Long conversations or large documents may exceed local model limits.
  • Tool-call formatting — Smaller models sometimes output malformed tool calls. Wolffish has retry logic, but repeated failures will end the turn.
  • No computer-use — Computer-use (screen interaction) requires vision capabilities that most local models lack. This capability currently requires Claude.

The Vision

We built Ollama integration because we believe the future of personal AI is local. Today, the best models are cloud-hosted. But model sizes are shrinking while capabilities grow. The gap between a 70B local model and a cloud frontier model narrows with every release. Wolffish is built for that future — where a single machine runs a fully capable AI agent with no internet connection, no subscription, no data leaving your control. Every architectural decision (stateless model, markdown-first, local memory) is designed so that the day a 14B model can reliably execute 20-tool agentic chains, Wolffish is ready. No code changes needed — just swap the model. Until then, use cloud providers for the heavy lifting and Ollama for what it does best: private, offline, always-available local inference.