Ollama Integration
Wolffish has complete, first-class integration with Ollama — an open-source local model runtime that lets you run LLMs entirely on your hardware. No API keys, no cloud dependency, no data leaving your machine.Why Ollama?
The deep goal of Wolffish is to run completely local with zero exposure to the internet. Every piece of data — memory, conversations, skills, task logs — already lives on your machine. The only component that traditionally requires the cloud is the LLM itself. Ollama closes that gap. For power users with capable hardware, this means:- Total privacy — Your prompts, responses, and tool outputs never leave your machine
- Zero recurring cost — No API bills, no token counting, no rate limits
- Offline capability — Full agentic workflows on an airplane, in a bunker, or behind an air-gapped network
- No vendor dependency — Your agent works regardless of API outages, pricing changes, or service discontinuation
How It Works
Wolffish communicates with Ollama via its local HTTP API:- Detect Ollama automatically on first launch (or later in Settings)
- Browse available models on your machine
- Pull new models directly from the Wolffish UI — no terminal needed
- Stream responses using NDJSON streaming
- Call tools via structured JSON in the model’s response
Setting Up Ollama
Install Ollama
Download from ollama.com and install. On macOS it’s a single
.dmg, on Linux a one-line curl command, on Windows a standard installer.Pull a model
Either use the terminal (
ollama pull qwen3:14b) or let Wolffish pull it for you from Settings → Models → Ollama.Ollama is optional. You can skip it entirely during onboarding and use only cloud providers (Claude, OpenAI). Wolffish will prompt you to configure at least one provider before you can start chatting.
Model Requirements for Agentic Tasks
Not all local models are equal. Wolffish’s agentic capabilities — tool calling, multi-step reasoning, code execution, file manipulation — place specific demands on the model. Here’s what you need to know:The Parameter Threshold
| Model Size | Conversational | Simple Tools | Multi-Step Agentic | Reliable Autonomous |
|---|---|---|---|---|
| 1B–3B | Basic chat only | Unreliable | No | No |
| 7B–8B | Good | Inconsistent | Struggles | No |
| 14B | Good | Mostly reliable | Basic chains | Fragile |
| 32B–35B | Excellent | Reliable | Handles well | Sometimes |
| 70B+ | Excellent | Reliable | Reliable | Yes |
Why Small Models Fail at Agentic Tasks
Tool calling requires the model to:- Understand the instruction — Parse what the user wants accomplished
- Plan the sequence — Decide which tools to call, in what order
- Format tool calls correctly — Output valid JSON with correct parameter names and types
- Interpret tool results — Read the output and decide the next action
- Maintain context across turns — Remember what it’s already done across a multi-step chain
- Handle errors gracefully — Retry, adjust, or ask for help when a tool fails
Recommended Models by Hardware
| RAM / VRAM | Model | Parameters | Agentic Reliability |
|---|---|---|---|
| 8GB | Qwen 3 8B, Gemma 3 4B | 4B–8B | Conversation only |
| 16GB | Qwen 3 14B, Gemma 3 12B | 12B–14B | Simple single-tool calls |
| 32GB | Qwen 3 32B, QwQ 32B | 32B | Multi-step workflows |
| 48GB+ | Llama 3 70B, Qwen 2.5 72B | 70B+ | Full autonomous agentic |
| 64GB+ / GPU | DeepSeek-V2, Llama 3.1 405B (quantized) | 70B+ | Production-grade |
The Honest Truth
If you have a standard laptop with 8–16GB RAM, local models will handle conversations, summarization, and simple Q&A well. But for the kind of autonomous multi-step workflows Wolffish excels at — researching topics, writing reports, managing files, executing shell commands in sequence — you’ll get dramatically better results with a cloud provider like Claude or GPT-4. The sweet spot for local-only agentic use:- Mac Studio / Mac Pro with 64GB+ unified memory — Run 70B models at acceptable speed
- Desktop with 24GB+ VRAM GPU — Full-speed 70B inference via CUDA
- High-end workstation with 128GB RAM — Run quantized 100B+ models
Reasoning modes
The brain icon next to the message box controls whether this model reasons. Click it to toggle reasoning on or off for models that support it.Thinking — whether the model reasons
- Off — the model answers immediately. Fastest; ideal for simple, direct tasks.
- On — the model first works through the problem in a dedicated reasoning pass before replying. Slower and uses more tokens, but markedly more accurate on multi-step, logical, or ambiguous tasks.
Button states
| State | Colour | Meaning |
|---|---|---|
| Off | gray | Thinking off — direct answer |
| On | blue | Thinking on |
Local-Only Mode
Wolffish includes a “Local Only” toggle that restricts all inference to Ollama — no data ever touches a cloud API, no matter which Brain model is selected. Enable it in the chat sidebar when you need absolute privacy for a sensitive task. In local-only mode:- Inference is forced to the local Ollama model regardless of which Brain you’ve selected
- No network requests are made for LLM inference
- Memory consolidation uses the local model
- All other features (memory, tools, capabilities) work normally
Limitations
- Speed — Local inference is slower than cloud APIs, especially on CPU-only machines. Expect 5–30 tokens/second depending on model size and hardware, versus 80–150 tokens/second from cloud providers.
- Context window — Most local models support 4K–32K context. Cloud models offer 128K–200K. Long conversations or large documents may exceed local model limits.
- Tool-call formatting — Smaller models sometimes output malformed tool calls. Wolffish has retry logic, but repeated failures will end the turn.
- No computer-use — Computer-use (screen interaction) requires vision capabilities that most local models lack. This capability currently requires Claude.