# instrumentation-analysis Design strategic logging and monitoring points for production debugging. ## Context This task analyzes code to identify optimal locations for instrumentation (logging, metrics, tracing) that will aid in debugging production issues without impacting performance. It creates a comprehensive observability strategy. ## Task Execution ### Phase 1: Critical Path Analysis #### Identify Key Flows 1. **User-Facing Paths**: Request → Response chains 2. **Business-Critical Paths**: Payment, authentication, data processing 3. **Performance-Sensitive Paths**: High-frequency operations 4. **Error-Prone Paths**: Historical problem areas 5. **Integration Points**: External service calls #### Map Decision Points - Conditional branches with business logic - State transitions - Error handling blocks - Retry mechanisms - Circuit breakers - Cache hits/misses ### Phase 2: Instrumentation Strategy #### Level 1: Essential Instrumentation **Entry/Exit Points:** ``` - Service boundaries (API endpoints) - Function entry/exit for critical operations - Database transaction boundaries - External service calls - Message queue operations ``` **Error Conditions:** ``` - Exception catches - Validation failures - Timeout occurrences - Retry attempts - Fallback activations ``` **Performance Markers:** ``` - Operation start/end times - Queue depths - Resource utilization - Batch sizes - Cache effectiveness ``` #### Level 2: Diagnostic Instrumentation **State Changes:** ``` - User state transitions - Order/payment status changes - Configuration updates - Feature flag toggles - Circuit breaker state changes ``` **Business Events:** ``` - User actions (login, purchase, etc.) - System events (startup, shutdown) - Scheduled job execution - Data pipeline stages - Workflow transitions ``` #### Level 3: Deep Debugging **Detailed Tracing:** ``` - Parameter values for complex functions - Intermediate calculation results - Loop iteration counts - Branch decisions - SQL query parameters ``` ### Phase 3: Implementation Patterns #### Structured Logging Format **Standard Fields:** ```json { "timestamp": "ISO-8601", "level": "INFO|WARN|ERROR", "service": "service-name", "trace_id": "correlation-id", "span_id": "operation-id", "user_id": "if-applicable", "operation": "what-is-happening", "duration_ms": "for-completed-ops", "status": "success|failure", "error": "error-details-if-any", "metadata": { "custom": "fields" } } ``` #### Performance-Conscious Patterns **Sampling Strategy:** ``` - 100% for errors - 10% for normal operations - 1% for high-frequency paths - Dynamic adjustment based on load ``` **Async Logging:** ``` - Buffer non-critical logs - Batch write to reduce I/O - Use separate thread/process - Implement backpressure handling ``` **Conditional Logging:** ``` - Debug level only in development - Info level in staging - Warn/Error in production - Dynamic level adjustment via config ``` ### Phase 4: Metrics Design #### Key Metrics to Track **RED Metrics:** - **Rate**: Requests per second - **Errors**: Error rate/count - **Duration**: Response time distribution **USE Metrics:** - **Utilization**: Resource usage percentage - **Saturation**: Queue depth, wait time - **Errors**: Resource allocation failures **Business Metrics:** - Transaction success rate - Feature usage - User journey completion - Revenue impact #### Metric Implementation **Counter Examples:** ``` requests_total{method="GET", endpoint="/api/users", status="200"} errors_total{type="database", operation="insert"} ``` **Histogram Examples:** ``` request_duration_seconds{method="GET", endpoint="/api/users"} database_query_duration_ms{query_type="select", table="users"} ``` **Gauge Examples:** ``` active_connections{service="database"} queue_depth{queue="email"} ``` ### Phase 5: Tracing Strategy #### Distributed Tracing Points **Span Creation:** ``` - HTTP request handling - Database operations - Cache operations - External API calls - Message publishing/consuming - Background job execution ``` **Context Propagation:** ``` - HTTP headers (X-Trace-Id) - Message metadata - Database comments - Log correlation ``` ## Output Format ````markdown # Instrumentation Analysis Report ## Executive Summary **Components Analyzed:** [count] **Current Coverage:** [percentage] **Recommended Additions:** [count] **Performance Impact:** [minimal/low/moderate] **Implementation Effort:** [hours/days] ## Critical Instrumentation Points ### Priority 1: Immediate Implementation #### Service: [ServiceName] **Entry Points:** ```[language] // Location: [file:line] // Current: No logging // Recommended: logger.info("Request received", { method: req.method, path: req.path, user_id: req.user?.id, trace_id: req.traceId }); ``` ```` **Error Handling:** ```[language] // Location: [file:line] // Current: Silent failure // Recommended: logger.error("Database operation failed", { operation: "user_update", user_id: userId, error: err.message, stack: err.stack, retry_count: retries }); ``` **Performance Tracking:** ```[language] // Location: [file:line] // Recommended: const startTime = Date.now(); try { const result = await expensiveOperation(); metrics.histogram('operation_duration_ms', Date.now() - startTime, { operation: 'expensive_operation', status: 'success' }); return result; } catch (error) { metrics.histogram('operation_duration_ms', Date.now() - startTime, { operation: 'expensive_operation', status: 'failure' }); throw error; } ``` ### Priority 2: Enhanced Observability [Similar format for medium priority points] ### Priority 3: Deep Debugging [Similar format for low priority points] ## Logging Strategy ### Log Levels by Environment | Level | Development | Staging | Production | | ----- | ----------- | ------- | ---------- | | DEBUG | ✓ | ✓ | ✗ | | INFO | ✓ | ✓ | Sampled | | WARN | ✓ | ✓ | ✓ | | ERROR | ✓ | ✓ | ✓ | ### Sampling Configuration ```yaml sampling: default: 0.01 # 1% sampling rules: - path: '/health' sample_rate: 0.001 # 0.1% for health checks - path: '/api/critical/*' sample_rate: 0.1 # 10% for critical APIs - level: 'ERROR' sample_rate: 1.0 # 100% for errors ``` ## Metrics Implementation ### Application Metrics ```[language] // Metric definitions const metrics = { // Counters requests: new Counter('http_requests_total', ['method', 'path', 'status']), errors: new Counter('errors_total', ['type', 'operation']), // Histograms duration: new Histogram('request_duration_ms', ['method', 'path']), dbDuration: new Histogram('db_query_duration_ms', ['operation', 'table']), // Gauges connections: new Gauge('active_connections', ['type']), queueSize: new Gauge('queue_size', ['queue_name']) }; ``` ### Dashboard Queries ```sql -- Error rate by endpoint SELECT endpoint, sum(errors) / sum(requests) as error_rate FROM metrics WHERE time > now() - 1h GROUP BY endpoint -- P95 latency SELECT endpoint, percentile(duration, 0.95) as p95_latency FROM metrics WHERE time > now() - 1h GROUP BY endpoint ``` ## Tracing Implementation ### Trace Context ```[language] // Trace context propagation class TraceContext { constructor(traceId, spanId, parentSpanId) { this.traceId = traceId || generateId(); this.spanId = spanId || generateId(); this.parentSpanId = parentSpanId; } createChild() { return new TraceContext(this.traceId, generateId(), this.spanId); } } // Usage middleware.use((req, res, next) => { req.trace = new TraceContext( req.headers['x-trace-id'], req.headers['x-span-id'], req.headers['x-parent-span-id'] ); next(); }); ``` ## Performance Considerations ### Impact Analysis | Instrumentation Type | CPU Impact | Memory Impact | I/O Impact | | -------------------- | ---------- | ------------- | -------------- | | Structured Logging | < 1% | < 10MB | Async buffered | | Metrics Collection | < 0.5% | < 5MB | Batched | | Distributed Tracing | < 2% | < 20MB | Sampled | ### Optimization Techniques 1. Use async logging with buffers 2. Implement sampling for high-frequency paths 3. Batch metric submissions 4. Use conditional compilation for debug logs 5. Implement circuit breakers for logging systems ## Implementation Plan ### Phase 1: Week 1 - [ ] Implement critical error logging - [ ] Add service boundary instrumentation - [ ] Set up basic metrics ### Phase 2: Week 2 - [ ] Add performance tracking - [ ] Implement distributed tracing - [ ] Create initial dashboards ### Phase 3: Week 3 - [ ] Add business event tracking - [ ] Implement sampling strategies - [ ] Performance optimization ## Monitoring & Alerts ### Critical Alerts ```yaml - name: high_error_rate condition: error_rate > 0.01 severity: critical - name: high_latency condition: p95_latency > 1000ms severity: warning - name: service_down condition: health_check_failures > 3 severity: critical ``` ## Validation Checklist - [ ] No sensitive data in logs - [ ] Trace IDs properly propagated - [ ] Sampling rates appropriate - [ ] Performance impact acceptable - [ ] Dashboards created - [ ] Alerts configured - [ ] Documentation updated ``` ## Completion Criteria - [ ] Critical paths identified - [ ] Instrumentation points mapped - [ ] Logging strategy defined - [ ] Metrics designed - [ ] Tracing plan created - [ ] Performance impact assessed - [ ] Implementation plan created - [ ] Monitoring strategy defined ```