Full History, Every Request
Wolffish uses a stateless, full-history-per-request model with every cloud provider. There are no persistent sessions, no thread IDs, no server-side conversation state. Each API call is self-contained and independent. This isn’t a limitation — it’s the architecture’s most important property.How It Works
Every turn, the agent assembles the complete message array from scratch:- Prefrontal builds the system prompt (identity, agents, memories, device context, skills)
- RAS filters and budget-allocates tokens across context categories
- The agent loads the full conversation history and appends it to the request
- Thalamus routes to the active provider, which transforms messages to the provider’s native format and sends a single HTTP request
Within a Single Turn
The agent runs a tool-use loop: call the model, execute tools, append results, call the model again. Each iteration sends the growing message array back to the provider:Provider-Specific Request Format
Each provider receives the same logical content, transformed to its native API:| Provider | Endpoint | System Prompt | Tool Results |
|---|---|---|---|
| DeepSeek | /chat/completions | First message in the array | OpenAI-compatible role: "tool" messages |
| Anthropic | /v1/messages | Separate system field | Coalesced into user-role content blocks |
| OpenAI | /v1/chat/completions | First message in the array | role: "tool" messages with tool_call_id |
| Ollama | /api/chat | Flat message array | Documents converted to placeholder text |
Prompt Caching
Sending the full history every call sounds expensive. It isn’t — because of prompt caching.Anthropic
The Anthropic provider uses prompt caching with threecache_control breakpoints:
- System prompt — The prefrontal context is large and nearly identical across turns
- Tool definitions — Stable within a conversation
- Conversation history prefix — The second-to-last user turn, marking the boundary between stable history and the latest exchange
OpenAI
OpenAI applies its own automatic prefix caching transparently (50% input discount on cache hits). No opt-in is needed.DeepSeek
DeepSeek applies its own prefix caching with a 75% input discount on cache hits, making it the most cost-efficient provider for long conversations and multi-step agentic workflows.Ollama
Ollama runs locally and has no caching layer.Why Stateless
Three properties depend on the stateless design, and losing any of them would compromise the architecture.Model Switching
Thalamus calls whichever Brain model you selected — DeepSeek, Anthropic, OpenAI, or Ollama — and there’s no automatic cascade between them. But because the app owns the full message array, switching your Brain model mid-conversation just works: the new model gets the complete history on its very next turn, with no thread to migrate. (Orchestrator mode leans on the same property — each worker is handed a complete, self-contained context.) Thread-based APIs would make that impossible.Privacy
No provider retains conversation state between calls. Each request is isolated. If a provider is swapped mid-conversation, the previous provider has nothing. Your conversation history lives on your machine, not on someone else’s server.Context Control
Because the app rebuilds context each turn through prefrontal and RAS, it controls exactly what the model sees. Token budgets, relevance filtering, memory selection — all happen before the request leaves your machine. Thread-based APIs accumulate everything the model has ever seen, and that granularity is lost.Persistence Is App-Side Only
Conversations are persisted locally at~/.wolffish/workspace/brain/conversations/ as JSON files containing the full message history, streaming segments, tool timings, and attachments. This is purely local — providers never see a conversation ID or resume from stored state.
The app is the source of truth. The providers are stateless compute.