Cost Optimization Tutorial¶

Learn how to build cost-effective AI workflows by selecting the right tools based on budget constraints and performance tradeoffs.

What You'll Learn¶

By the end of this tutorial, you'll understand:

When to optimize for cost vs speed vs reliability
How to configure budget constraints for tool selection
Tradeoffs between expensive premium tools and cheap alternatives
Real-world cost optimization strategies

Prerequisites¶

Basic understanding of ToolWeaver tool discovery
Familiarity with tool registration

The Cost Problem¶

Imagine you're processing 1,000 receipts per day:

Tool	Cost per Call	Total Daily Cost
GPT-4 Vision	$0.10	$100/day ($3,000/month)
Claude Vision	$0.05	$50/day ($1,500/month)
Local OCR	$0.01	$10/day ($300/month)

Question: Which tool should you use?

Answer: It depends on your constraints: - Tight budget? Use local OCR (10x cheaper) - Need accuracy? Use GPT-4 (but expensive) - Balance both? Use Claude (2x cheaper than GPT-4, still accurate)

ToolWeaver's CostOptimizer helps you make this decision automatically based on configurable weights.

Core Concepts¶

1. Cost-Quality Tradeoff¶

Every tool has three dimensions:

Cost - How much per API call ($0.01 to $0.50+)
Latency - How long it takes (100ms to 30s)
Reliability - Success rate (70% to 99.9%)

You can't optimize all three simultaneously—you must choose priorities.

2. Selection Weights¶

Configure how much each dimension matters:

SelectionConfig(
    cost_weight=0.7,        # 70% priority on cost
    latency_weight=0.2,     # 20% priority on speed
    reliability_weight=0.1, # 10% priority on accuracy
)

Total must equal 1.0

3. Budget Constraints¶

Set hard limits:

SelectionConfig(
    cost_budget=0.02,         # Never spend more than $0.02/call
    latency_budget=2000,      # Never wait more than 2s
    reliability_threshold=0.9, # Require 90%+ success rate
)

When to Use Cost Optimization¶

✅ Use Cost Optimization When:¶

High-volume processing (100+ operations/day)
Batch document processing
Receipt scanning at scale
Log analysis pipelines
Predictable workloads (similar tasks repeated)
Daily report generation
Scheduled data extraction
Recurring classification
Budget constraints (fixed monthly spend)
Startup prototyping
Free-tier development
Cost-sensitive production

❌ Don't Optimize Cost When:¶

Accuracy is critical (medical, financial, legal)
Always use best model available
Cost is secondary to correctness
One-off tasks (manual analysis)
Optimization overhead not worth it
Just use your default model
Real-time requirements (latency < 500ms)
Speed matters more than cost
Use fastest available tool

Practical Example: Receipt Processing¶

Let's build a cost-optimized receipt processor that adapts to workload.

Step 1: Define Tool Options¶

Register three vision tools with different cost profiles:

from orchestrator.selection.registry import ToolRegistry
from orchestrator.shared.models import ToolDefinition

registry = ToolRegistry()

# Premium: Fast and accurate, but expensive
gpt4_vision = ToolDefinition(
    name="gpt4_vision",
    description="Premium vision analysis with GPT-4",
    metadata={
        "cost_per_call": 0.10,
        "expected_latency_ms": 500,
        "success_rate": 0.99,
    },
)

# Balanced: Good performance at medium cost
claude_vision = ToolDefinition(
    name="claude_vision",
    description="Balanced vision analysis with Claude",
    metadata={
        "cost_per_call": 0.05,
        "expected_latency_ms": 1500,
        "success_rate": 0.95,
    },
)

# Budget: Slow and less accurate, but cheap
local_ocr = ToolDefinition(
    name="local_ocr",
    description="Local OCR with Tesseract",
    metadata={
        "cost_per_call": 0.01,
        "expected_latency_ms": 3000,
        "success_rate": 0.85,
    },
)

registry.register(gpt4_vision)
registry.register(claude_vision)
registry.register(local_ocr)

Step 2: Configure Selection Criteria¶

Scenario A: Tight Budget (Startup)

from orchestrator.selection.registry import SelectionConfig

# Minimize cost, tolerate slower/less accurate results
startup_config = SelectionConfig(
    cost_weight=0.8,
    latency_weight=0.1,
    reliability_weight=0.1,
    cost_budget=0.02,  # Never exceed $0.02/call
)

tool = registry.get_best_tool(startup_config)
print(f"Selected: {tool.name}")  # → local_ocr
print(f"Cost: ${tool.metadata['cost_per_call']}")  # → $0.01

Scenario B: Production (Balanced)

# Balance cost and accuracy
production_config = SelectionConfig(
    cost_weight=0.4,
    latency_weight=0.2,
    reliability_weight=0.4,
    cost_budget=0.08,  # Can spend up to $0.08/call
)

tool = registry.get_best_tool(production_config)
print(f"Selected: {tool.name}")  # → claude_vision
print(f"Cost: ${tool.metadata['cost_per_call']}")  # → $0.05

Scenario C: Critical Accuracy (Enterprise)

# Prioritize accuracy over cost
enterprise_config = SelectionConfig(
    cost_weight=0.1,
    latency_weight=0.2,
    reliability_weight=0.7,
    reliability_threshold=0.95,  # Require 95%+ success rate
)

tool = registry.get_best_tool(enterprise_config)
print(f"Selected: {tool.name}")  # → gpt4_vision
print(f"Accuracy: {tool.metadata['success_rate']}")  # → 0.99

Step 3: Adaptive Selection¶

Change selection dynamically based on workload:

def select_tool_for_batch(batch_size: int) -> ToolDefinition:
    """Select tool based on batch size."""

    if batch_size > 1000:
        # High volume → use cheapest
        config = SelectionConfig(cost_weight=0.9, cost_budget=0.02)
    elif batch_size > 100:
        # Medium volume → balance cost and accuracy
        config = SelectionConfig(cost_weight=0.5, reliability_weight=0.5)
    else:
        # Low volume → prioritize accuracy
        config = SelectionConfig(reliability_weight=0.8)

    return registry.get_best_tool(config)

# Process 500 receipts
tool = select_tool_for_batch(500)
print(f"Processing 500 items with: {tool.name}")  # → claude_vision

Cost-Saving Strategies¶

Strategy 1: Tiered Fallback Chain¶

Use expensive tools only when cheap ones fail:

# Try cheap tool first, fallback to expensive on failure
cheap_result = await execute_tool(local_ocr, receipt)

if cheap_result.confidence < 0.8:
    # Low confidence → retry with premium tool
    expensive_result = await execute_tool(gpt4_vision, receipt)
    return expensive_result

return cheap_result  # Saved $0.09!

Strategy 2: Batch Processing¶

Aggregate multiple requests to reduce per-call overhead:

# Instead of 100 individual calls ($10):
for receipt in receipts:
    process(receipt)  # $0.10 × 100 = $10

# Use batch processing ($3):
batch_results = await process_batch(receipts)  # $0.03 × 100 = $3

Strategy 3: Caching¶

Cache expensive operations:

from orchestrator.infra.redis_cache import RedisCache

cache = RedisCache(ttl=3600)  # 1 hour

@cache.cached("receipt-{receipt_id}")
async def process_receipt(receipt_id: str):
    # First call: $0.10
    # Subsequent calls in 1 hour: $0 (cached)
    return await gpt4_vision.execute(receipt_id)

Savings: 90% if 10 cache hits per unique receipt

Measuring Cost Savings¶

Track actual vs potential costs:

from orchestrator.monitoring.cost_tracker import CostTracker

tracker = CostTracker()

# Track actual selection
actual_tool = registry.get_best_tool(startup_config)
actual_cost = actual_tool.metadata["cost_per_call"]
tracker.log_selection(actual_tool, actual_cost)

# Compare to baseline (if we always used premium)
baseline_tool = registry.get_tool("gpt4_vision")
baseline_cost = baseline_tool.metadata["cost_per_call"]

savings = baseline_cost - actual_cost
savings_pct = (savings / baseline_cost) * 100

print(f"Actual cost: ${actual_cost:.2f}")
print(f"Baseline cost: ${baseline_cost:.2f}")
print(f"Savings: ${savings:.2f} ({savings_pct:.0f}%)")

Example Output:

Actual cost: $0.01
Baseline cost: $0.10
Savings: $0.09 (90%)

Real-World Scenarios¶

Scenario 1: Startup with $100/month Budget¶

Goal: Process as many receipts as possible within budget.

monthly_budget = 100.00
receipts_per_month = 5000

max_cost_per_call = monthly_budget / receipts_per_month  # $0.02

config = SelectionConfig(
    cost_weight=0.95,
    cost_budget=max_cost_per_call,
)

tool = registry.get_best_tool(config)  # → local_ocr ($0.01)

# Can process 10,000 receipts instead of 5,000!
actual_capacity = monthly_budget / tool.metadata["cost_per_call"]
print(f"Monthly capacity: {actual_capacity:.0f} receipts")  # 10,000

Scenario 2: Enterprise with SLA Requirements¶

Goal: 99% accuracy, process within 2 seconds.

config = SelectionConfig(
    reliability_weight=0.7,
    latency_weight=0.3,
    reliability_threshold=0.99,
    latency_budget=2000,  # 2 seconds
)

tool = registry.get_best_tool(config)  # → gpt4_vision
# Costs more, but meets SLA

Scenario 3: Hybrid Workload¶

Goal: Mix of simple (cheap) and complex (expensive) receipts.

def classify_complexity(receipt) -> str:
    # Simple: Only a few line items
    if len(receipt.items) < 5:
        return "simple"
    # Complex: Many items or poor image quality
    return "complex"

async def process_adaptive(receipt):
    complexity = classify_complexity(receipt)

    if complexity == "simple":
        # Use cheap tool for simple cases
        config = SelectionConfig(cost_weight=0.9)
    else:
        # Use accurate tool for complex cases
        config = SelectionConfig(reliability_weight=0.9)

    tool = registry.get_best_tool(config)
    return await tool.execute(receipt)

# Average cost: $0.03 instead of $0.10 (70% savings)

Best Practices¶

✅ Do's¶

Profile your tools - Measure actual cost/latency/accuracy
Set realistic budgets - Don't optimize to unusable quality
Cache aggressively - 80%+ cache hit rate = 80% cost savings
Use fallbacks - Try cheap first, fallback to expensive
Monitor actual costs - Track daily/monthly spend

❌ Don'ts¶

Don't sacrifice accuracy for pennies - Incorrect results cost more to fix
Don't over-optimize - Diminishing returns below $0.01/call
Don't ignore latency - Slow tools frustrate users
Don't hardcode tool names - Use selection config for flexibility
Don't skip error handling - Failed calls still cost money

Next Steps¶

How-To Guide: Optimize Tool Costs - Step-by-step implementation
Deep Dive: Hybrid Model Routing - Two-model architecture
Sample: 27-cost-optimization - Working example

Error Recovery - Handle failures without wasting money
Caching Deep Dive - Reduce costs with caching
Two-Model Architecture - GPT-4 for planning, Phi-3 for execution