Production Deployment ===================== Deploy LRS-Agents to production with monitoring, logging, and high availability. Overview -------- This guide covers: * Production-ready architecture * Monitoring and alerting * Structured logging * Scaling strategies * High availability * Security best practices Architecture ------------ Recommended Stack ^^^^^^^^^^^^^^^^^ .. code-block:: text ┌─────────────────────────────────────────┐ │ Load Balancer (Nginx) │ └─────────────────┬───────────────────────┘ │ ┌───────────┴───────────┐ │ │ ┌─────▼──────┐ ┌─────▼──────┐ │ LRS API │ │ LRS API │ (Multiple instances) │ Instance │ │ Instance │ └─────┬──────┘ └─────┬──────┘ │ │ └───────────┬───────────┘ │ ┌───────────┴───────────┐ │ │ ┌─────▼──────┐ ┌─────▼──────┐ │ PostgreSQL │ │ Redis │ │ Database │ │ Cache │ └────────────┘ └────────────┘ Components: * **Load Balancer**: Distributes traffic across instances * **LRS API Instances**: Stateless agent execution servers * **PostgreSQL**: Persistent storage for execution history * **Redis**: Caching and job queue Docker Deployment ----------------- Basic Docker Setup ^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Build image docker build -t lrs-agents:latest -f docker/Dockerfile . # Run single container docker run -d \ -p 8000:8000 \ -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \ -e DATABASE_URL=$DATABASE_URL \ lrs-agents:latest Docker Compose ^^^^^^^^^^^^^^ Use Docker Compose for local development and testing: .. code-block:: bash cd docker docker-compose up -d # Services available: # - API: http://localhost:8000 # - Dashboard: http://localhost:8501 # - Database: localhost:5432 Production Docker Compose: .. code-block:: yaml # docker-compose.prod.yml version: '3.8' services: lrs-api: image: lrsagents/lrs-agents:latest deploy: replicas: 3 resources: limits: cpus: '2' memory: 4G reservations: cpus: '1' memory: 2G environment: - DATABASE_URL=postgresql://user:pass@postgres:5432/lrs - REDIS_URL=redis://redis:6379/0 - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} - LOG_LEVEL=INFO depends_on: - postgres - redis postgres: image: postgres:15-alpine volumes: - postgres_data:/var/lib/postgresql/data environment: POSTGRES_PASSWORD: ${DB_PASSWORD} redis: image: redis:7-alpine volumes: - redis_data:/var/lib/redis nginx: image: nginx:alpine ports: - "80:80" - "443:443" volumes: - ./nginx.conf:/etc/nginx/nginx.conf - ./ssl:/etc/nginx/ssl depends_on: - lrs-api volumes: postgres_data: redis_data: Kubernetes Deployment --------------------- Basic Deployment ^^^^^^^^^^^^^^^^ .. code-block:: bash # Create namespace kubectl create namespace lrs-agents # Apply configurations kubectl apply -f k8s/configmap.yaml kubectl apply -f k8s/secrets.yaml kubectl apply -f k8s/deployment.yaml kubectl apply -f k8s/service.yaml kubectl apply -f k8s/hpa.yaml Verify deployment: .. code-block:: bash # Check pods kubectl get pods -n lrs-agents # Check services kubectl get svc -n lrs-agents # View logs kubectl logs -f deployment/lrs-agents -n lrs-agents Production Configuration ^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: yaml # k8s/deployment-prod.yaml apiVersion: apps/v1 kind: Deployment metadata: name: lrs-agents namespace: lrs-agents spec: replicas: 5 # Start with 5 replicas selector: matchLabels: app: lrs-agents template: metadata: labels: app: lrs-agents spec: containers: - name: lrs-api image: lrsagents/lrs-agents:v0.2.0 # Pin version resources: requests: memory: "2Gi" cpu: "500m" limits: memory: "4Gi" cpu: "2000m" env: - name: DATABASE_URL valueFrom: secretKeyRef: name: lrs-secrets key: database-url livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 10 periodSeconds: 5 Auto-scaling: .. code-block:: yaml # k8s/hpa-prod.yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: lrs-agents-hpa namespace: lrs-agents spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: lrs-agents minReplicas: 5 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 Monitoring ---------- Structured Logging ^^^^^^^^^^^^^^^^^^ Set up structured logging for production: .. code-block:: python from lrs.monitoring.structured_logging import create_logger_for_agent import logging # Create logger logger = create_logger_for_agent( agent_id="production_agent", log_file="/var/log/lrs/agent.jsonl", console=False, # Disable console in production level=logging.INFO ) # Log events logger.log_tool_execution( tool_name="fetch_api", success=True, execution_time=150.5, prediction_error=0.1 ) logger.log_adaptation_event( trigger="High prediction error", old_precision=0.6, new_precision=0.4, action="Explore alternatives" ) Log Aggregation ^^^^^^^^^^^^^^^ Send logs to centralized system: **ELK Stack (Elasticsearch, Logstash, Kibana):** .. code-block:: yaml # filebeat.yml filebeat.inputs: - type: log enabled: true paths: - /var/log/lrs/*.jsonl json.keys_under_root: true json.add_error_key: true output.elasticsearch: hosts: ["elasticsearch:9200"] **Datadog:** .. code-block:: python from datadog import initialize, statsd # Initialize Datadog initialize(api_key=os.getenv('DATADOG_API_KEY')) # Send metrics statsd.increment('lrs.agent.execution') statsd.histogram('lrs.precision', precision_value) statsd.gauge('lrs.tool.success_rate', success_rate) Metrics and Alerting ^^^^^^^^^^^^^^^^^^^^ Expose Prometheus metrics: .. code-block:: python from prometheus_client import Counter, Histogram, Gauge, start_http_server # Define metrics agent_runs = Counter('lrs_agent_runs_total', 'Total agent runs') tool_executions = Counter('lrs_tool_executions_total', 'Total tool executions', ['tool', 'status']) precision_value = Gauge('lrs_precision_value', 'Current precision', ['level']) execution_time = Histogram('lrs_execution_time_seconds', 'Execution time') # Record metrics agent_runs.inc() tool_executions.labels(tool='fetch_api', status='success').inc() precision_value.labels(level='execution').set(0.75) with execution_time.time(): result = agent.run(task) # Start metrics server start_http_server(9090) Prometheus configuration: .. code-block:: yaml # prometheus.yml scrape_configs: - job_name: 'lrs-agents' static_configs: - targets: ['lrs-api:9090'] scrape_interval: 15s Alerting rules: .. code-block:: yaml # alerts.yml groups: - name: lrs_agents rules: - alert: HighFailureRate expr: rate(lrs_tool_executions_total{status="failure"}[5m]) > 0.5 for: 5m labels: severity: warning annotations: summary: "High tool failure rate" - alert: LowPrecision expr: lrs_precision_value{level="execution"} < 0.3 for: 10m labels: severity: warning annotations: summary: "Agent precision consistently low" - alert: ServiceDown expr: up{job="lrs-agents"} == 0 for: 2m labels: severity: critical annotations: summary: "LRS-Agents service is down" Dashboard ^^^^^^^^^ Run Streamlit dashboard for real-time monitoring: .. code-block:: bash # In separate container/pod streamlit run lrs/monitoring/dashboard.py --server.port=8501 Grafana dashboards: .. code-block:: json { "dashboard": { "title": "LRS-Agents Monitoring", "panels": [ { "title": "Precision Over Time", "targets": [ { "expr": "lrs_precision_value{level=\"execution\"}" } ] }, { "title": "Tool Success Rate", "targets": [ { "expr": "rate(lrs_tool_executions_total{status=\"success\"}[5m]) / rate(lrs_tool_executions_total[5m])" } ] }, { "title": "Adaptation Events", "targets": [ { "expr": "rate(lrs_adaptation_events_total[5m])" } ] } ] } } Database Management ------------------- Schema Setup ^^^^^^^^^^^^ Initialize production database: .. code-block:: bash # Run migrations psql $DATABASE_URL < docker/init.sql # Or use migration tool alembic upgrade head Connection Pooling ^^^^^^^^^^^^^^^^^^ Configure connection pooling: .. code-block:: python from sqlalchemy import create_engine from sqlalchemy.pool import QueuePool engine = create_engine( DATABASE_URL, poolclass=QueuePool, pool_size=20, # Connections per instance max_overflow=10, # Additional connections pool_timeout=30, # Wait timeout pool_recycle=3600, # Recycle connections after 1 hour pool_pre_ping=True # Verify connections before use ) Backup Strategy ^^^^^^^^^^^^^^^ Automated backups: .. code-block:: bash #!/bin/bash # backup.sh DATE=$(date +%Y%m%d_%H%M%S) BACKUP_FILE="lrs_backup_$DATE.sql" # Create backup pg_dump $DATABASE_URL > $BACKUP_FILE # Compress gzip $BACKUP_FILE # Upload to S3 aws s3 cp $BACKUP_FILE.gz s3://lrs-backups/ # Cleanup old backups (keep last 30 days) find . -name "lrs_backup_*.sql.gz" -mtime +30 -delete Schedule with cron: .. code-block:: cron # Daily backups at 2 AM 0 2 * * * /path/to/backup.sh Security -------- API Authentication ^^^^^^^^^^^^^^^^^^ Implement JWT authentication: .. code-block:: python from fastapi import Depends, HTTPException from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials import jwt security = HTTPBearer() def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)): try: payload = jwt.decode( credentials.credentials, SECRET_KEY, algorithms=["HS256"] ) return payload except jwt.InvalidTokenError: raise HTTPException(status_code=401, detail="Invalid token") @app.post("/api/agent/run") async def run_agent(task: str, token: dict = Depends(verify_token)): # Execute agent pass Environment Variables ^^^^^^^^^^^^^^^^^^^^^ Securely manage secrets: .. code-block:: bash # Never commit secrets to version control # Use environment variables or secret management # Development export ANTHROPIC_API_KEY="sk-ant-..." # Production - Use secret management # AWS Secrets Manager aws secretsmanager get-secret-value --secret-id lrs/api-keys # Kubernetes Secrets kubectl create secret generic lrs-secrets \ --from-literal=anthropic-api-key=sk-ant-... Rate Limiting ^^^^^^^^^^^^^ Implement rate limiting: .. code-block:: python from fastapi import Request from slowapi import Limiter from slowapi.util import get_remote_address limiter = Limiter(key_func=get_remote_address) @app.post("/api/agent/run") @limiter.limit("10/minute") # 10 requests per minute async def run_agent(request: Request, task: str): # Execute agent pass Performance Optimization ------------------------ Caching ^^^^^^^ Implement Redis caching: .. code-block:: python import redis import hashlib import json redis_client = redis.Redis(host='redis', port=6379, db=0) def cache_agent_result(task: str, result: dict, ttl: int = 3600): """Cache agent execution result""" cache_key = hashlib.md5(task.encode()).hexdigest() redis_client.setex(cache_key, ttl, json.dumps(result)) def get_cached_result(task: str): """Get cached result if available""" cache_key = hashlib.md5(task.encode()).hexdigest() cached = redis_client.get(cache_key) return json.loads(cached) if cached else None # Usage result = get_cached_result(task) if not result: result = agent.run(task) cache_agent_result(task, result) Async Execution ^^^^^^^^^^^^^^^ Use async for better throughput: .. code-block:: python import asyncio from concurrent.futures import ThreadPoolExecutor executor = ThreadPoolExecutor(max_workers=10) async def run_agent_async(task: str): """Run agent in thread pool""" loop = asyncio.get_event_loop() result = await loop.run_in_executor( executor, agent.run, task ) return result # Handle multiple requests concurrently tasks = [run_agent_async(t) for t in task_list] results = await asyncio.gather(*tasks) Resource Limits ^^^^^^^^^^^^^^^ Set resource limits: .. code-block:: python # Limit maximum iterations result = agent.run(task, max_iterations=50) # Timeout protection import signal def timeout_handler(signum, frame): raise TimeoutError("Agent execution timeout") signal.signal(signal.SIGALRM, timeout_handler) signal.alarm(300) # 5 minute timeout try: result = agent.run(task) except TimeoutError: logger.error("Agent execution timed out") finally: signal.alarm(0) Health Checks ------------- Implement health check endpoint: .. code-block:: python from fastapi import FastAPI, status from sqlalchemy import text app = FastAPI() @app.get("/health") async def health_check(): """Health check endpoint""" health = { "status": "healthy", "version": "0.2.0", "checks": {} } # Check database try: with engine.connect() as conn: conn.execute(text("SELECT 1")) health["checks"]["database"] = "ok" except Exception as e: health["status"] = "unhealthy" health["checks"]["database"] = f"error: {str(e)}" # Check Redis try: redis_client.ping() health["checks"]["redis"] = "ok" except Exception as e: health["status"] = "unhealthy" health["checks"]["redis"] = f"error: {str(e)}" # Check API keys if not os.getenv("ANTHROPIC_API_KEY"): health["status"] = "unhealthy" health["checks"]["api_keys"] = "missing" else: health["checks"]["api_keys"] = "ok" status_code = ( status.HTTP_200_OK if health["status"] == "healthy" else status.HTTP_503_SERVICE_UNAVAILABLE ) return health, status_code Troubleshooting --------------- Common Issues ^^^^^^^^^^^^^ **High Memory Usage:** .. code-block:: bash # Check memory usage kubectl top pods -n lrs-agents # Increase memory limits # Update deployment.yaml and apply **Database Connection Errors:** .. code-block:: python # Enable connection pooling # Add pool_pre_ping=True # Increase pool_size **Slow Response Times:** .. code-block:: bash # Check logs for slow operations kubectl logs -f deployment/lrs-agents -n lrs-agents | grep "execution_time" # Enable caching # Scale horizontally Debug Mode ^^^^^^^^^^ Enable debug logging: .. code-block:: bash # Set environment variable export LOG_LEVEL=DEBUG # Or in Kubernetes kubectl set env deployment/lrs-agents LOG_LEVEL=DEBUG -n lrs-agents Checklist --------- Pre-deployment: * [ ] API keys configured * [ ] Database initialized * [ ] Secrets properly managed * [ ] Resource limits set * [ ] Health checks implemented * [ ] Monitoring configured * [ ] Logging set up * [ ] Backups automated * [ ] Rate limiting enabled * [ ] Load balancing configured Post-deployment: * [ ] Health checks passing * [ ] Metrics being collected * [ ] Logs aggregating correctly * [ ] Alerts configured * [ ] Dashboard accessible * [ ] Performance acceptable * [ ] Error rate within limits Next Steps ---------- * Set up monitoring with Prometheus/Grafana * Configure log aggregation (ELK, Datadog) * Implement CI/CD pipeline * Load test your deployment * Document runbooks for common issues * Set up on-call rotation Further Reading --------------- * :doc:`../api/monitoring` - Monitoring API reference * :doc:`../tutorials/07_production_deployment` - Production tutorial * Kubernetes documentation * Docker best practices * Prometheus operator guide