Skip to content

Two-Model Architecture

Simple Explanation

Use the big model to decide, and the small models/tools to do. One smart plan, many cheap steps.

Technical Explanation

Planner (GPT-4o/Claude) generates a DAG once. Executors (Phi-3/Llama, MCP/tools/sandboxed code) run steps with retries/fallbacks and caching. Hybrid search narrows tools; guardrails keep cost, time, and failures controlled.

ToolWeaver separates planning (large models) from execution (small models) to cut cost and latency.

  • Planner: GPT-4o/Claude turn user intent into structured plans (1 call/request).
  • Executors: Phi-3/Llama or deterministic tools perform parsing, routing, and enrichment (many cheap calls).
  • Cost impact: Up to 80–90% savings possible vs all-large-model by pushing repetitive work to small models + caching (varies by task complexity mix).
  • Safety: planners see only necessary context; executors run in sandbox/process isolation.

Execution flow 1) Plan: large model generates DAG with tool calls. 2) Search: registry search narrows tools (BM25 + embeddings). 3) Execute: small models/tools run steps; retries/fallbacks applied. 4) Aggregate: results merged; planner re-engaged only if needed.

When to use - Workloads with many atomic steps (extraction, classification, routing). - Cost-sensitive pipelines needing predictable spend.

Pitfalls / tips - Keep tool schemas tight to reduce planner token use. - Use caching (prompt/query/embedding) to avoid re-planning on similar inputs. - Prefer deterministic tools where possible; use small models for light reasoning, not heavy synthesis.