Production Deployment Guide¶
Covers security hardening, performance, monitoring, scaling, and troubleshooting for production rollouts.
Prerequisites¶
- Python: 3.11+
- RAM: 2 GB minimum (4 GB recommended if you run local embeddings)
- CPU: 2+ cores recommended for parallel tool execution
- Storage: ~1 GB for models and cache
Dependencies (baseline)¶
pip install pydantic>=2.0
pip install aiohttp>=3.13
# Azure OpenAI (required)
pip install azure-identity>=1.12
pip install openai>=1.0
# Semantic search (optional but recommended)
pip install sentence-transformers>=5.0
pip install rank-bm25>=0.2
pip install torch>=2.0
pip install scikit-learn>=1.8
# Testing (dev only)
pip install pytest>=9.0
pip install pytest-asyncio>=1.3
pip install pytest-mock>=3.15
Security Hardening¶
Code execution sandbox¶
Already implemented via ProgrammaticToolExecutor; ensure timeouts and call limits are set.
from orchestrator.programmatic_executor import ProgrammaticToolExecutor
executor = ProgrammaticToolExecutor(tool_catalog=catalog, mcp_client=mcp_client, timeout=30, max_tool_calls=100)
Blocked: file system, network, process spawn, eval/exec, dangerous builtins.
Azure AD authentication¶
Prefer Managed Identity/Workload Identity over API keys.
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
Use RBAC and token auto-renewal; avoid embedding keys.
Input validation¶
Use Pydantic models with length limits and pattern checks where needed.
Resource limits¶
Set per-call timeouts, max concurrency, and rate limits (Redis or in-memory). Example: TOOL_TIMEOUT=30, MAX_CONCURRENT_TOOLS=10, REQUESTS_PER_MINUTE=60.
Performance¶
Lazy loading¶
Defer heavy inits (e.g., search engine) until first use.
Caching¶
Use multi-level caching (discovery, embeddings, queries, prompt cache). Configure TTLs to balance freshness and cost.
Connection pooling¶
Reuse aiohttp sessions with sensible connection limits when calling external services.
Parallel execution¶
Use asyncio.gather(..., return_exceptions=True) for independent tool calls.
Analytics & Monitoring¶
Choose one backend via env vars: - Prometheus (self-hosted scraping) - OTLP (Grafana Cloud push) - SQLite (local dev)
ANALYTICS_BACKEND=prometheus # or otlp or sqlite
Create the client via create_analytics_client(); metrics include success/failure/latency/ratings/health scores. Prefer Prometheus/OTLP over legacy execution logging.
Expose health/metrics endpoints for probes and scraping (FastAPI example):
@app.get("/health")
async def health():
return {"status": "healthy"}
Add /metrics when using Prometheus backend.
Scaling¶
- Stateless app: keep request state out of process; share cache over NFS/Azure Files.
- Horizontal scale behind a load balancer; no sticky sessions required.
- Kubernetes resource requests: start around 1 CPU/2 GiB; allow headroom for caches.
- Shared cache PVC for embeddings/catalog if using file cache.
Deployment patterns¶
- Containers + Kubernetes (preferred): add liveness/readiness probes; mount cache volume if needed; configure secrets via K8s secrets/Key Vault.
- Azure App Service/Container Apps: ensure Managed Identity enabled for Azure OpenAI; configure env vars for Redis/Qdrant.
Troubleshooting quick checks¶
- Health: hit
/health - Metrics:
/metricswhen Prometheus enabled - Cache status: Redis/Qdrant reachable? fall back behaviors enabled?
- Timeouts: confirm per-tool timeout and global request timeouts
- Credentials: prefer Managed Identity; if keys, verify env vars loaded