9.5 KiB
9.5 KiB
instrumentation-analysis
Design strategic logging and monitoring points for production debugging.
Context
This task analyzes code to identify optimal locations for instrumentation (logging, metrics, tracing) that will aid in debugging production issues without impacting performance. It creates a comprehensive observability strategy.
Task Execution
Phase 1: Critical Path Analysis
Identify Key Flows
- User-Facing Paths: Request → Response chains
- Business-Critical Paths: Payment, authentication, data processing
- Performance-Sensitive Paths: High-frequency operations
- Error-Prone Paths: Historical problem areas
- Integration Points: External service calls
Map Decision Points
- Conditional branches with business logic
- State transitions
- Error handling blocks
- Retry mechanisms
- Circuit breakers
- Cache hits/misses
Phase 2: Instrumentation Strategy
Level 1: Essential Instrumentation
Entry/Exit Points:
- Service boundaries (API endpoints)
- Function entry/exit for critical operations
- Database transaction boundaries
- External service calls
- Message queue operations
Error Conditions:
- Exception catches
- Validation failures
- Timeout occurrences
- Retry attempts
- Fallback activations
Performance Markers:
- Operation start/end times
- Queue depths
- Resource utilization
- Batch sizes
- Cache effectiveness
Level 2: Diagnostic Instrumentation
State Changes:
- User state transitions
- Order/payment status changes
- Configuration updates
- Feature flag toggles
- Circuit breaker state changes
Business Events:
- User actions (login, purchase, etc.)
- System events (startup, shutdown)
- Scheduled job execution
- Data pipeline stages
- Workflow transitions
Level 3: Deep Debugging
Detailed Tracing:
- Parameter values for complex functions
- Intermediate calculation results
- Loop iteration counts
- Branch decisions
- SQL query parameters
Phase 3: Implementation Patterns
Structured Logging Format
Standard Fields:
{
"timestamp": "ISO-8601",
"level": "INFO|WARN|ERROR",
"service": "service-name",
"trace_id": "correlation-id",
"span_id": "operation-id",
"user_id": "if-applicable",
"operation": "what-is-happening",
"duration_ms": "for-completed-ops",
"status": "success|failure",
"error": "error-details-if-any",
"metadata": {
"custom": "fields"
}
}
Performance-Conscious Patterns
Sampling Strategy:
- 100% for errors
- 10% for normal operations
- 1% for high-frequency paths
- Dynamic adjustment based on load
Async Logging:
- Buffer non-critical logs
- Batch write to reduce I/O
- Use separate thread/process
- Implement backpressure handling
Conditional Logging:
- Debug level only in development
- Info level in staging
- Warn/Error in production
- Dynamic level adjustment via config
Phase 4: Metrics Design
Key Metrics to Track
RED Metrics:
- Rate: Requests per second
- Errors: Error rate/count
- Duration: Response time distribution
USE Metrics:
- Utilization: Resource usage percentage
- Saturation: Queue depth, wait time
- Errors: Resource allocation failures
Business Metrics:
- Transaction success rate
- Feature usage
- User journey completion
- Revenue impact
Metric Implementation
Counter Examples:
requests_total{method="GET", endpoint="/api/users", status="200"}
errors_total{type="database", operation="insert"}
Histogram Examples:
request_duration_seconds{method="GET", endpoint="/api/users"}
database_query_duration_ms{query_type="select", table="users"}
Gauge Examples:
active_connections{service="database"}
queue_depth{queue="email"}
Phase 5: Tracing Strategy
Distributed Tracing Points
Span Creation:
- HTTP request handling
- Database operations
- Cache operations
- External API calls
- Message publishing/consuming
- Background job execution
Context Propagation:
- HTTP headers (X-Trace-Id)
- Message metadata
- Database comments
- Log correlation
Output Format
# Instrumentation Analysis Report
## Executive Summary
**Components Analyzed:** [count]
**Current Coverage:** [percentage]
**Recommended Additions:** [count]
**Performance Impact:** [minimal/low/moderate]
**Implementation Effort:** [hours/days]
## Critical Instrumentation Points
### Priority 1: Immediate Implementation
#### Service: [ServiceName]
**Entry Points:**
```[language]
// Location: [file:line]
// Current: No logging
// Recommended:
logger.info("Request received", {
method: req.method,
path: req.path,
user_id: req.user?.id,
trace_id: req.traceId
});
```
Error Handling:
// Location: [file:line]
// Current: Silent failure
// Recommended:
logger.error("Database operation failed", {
operation: "user_update",
user_id: userId,
error: err.message,
stack: err.stack,
retry_count: retries
});
Performance Tracking:
// Location: [file:line]
// Recommended:
const startTime = Date.now();
try {
const result = await expensiveOperation();
metrics.histogram('operation_duration_ms', Date.now() - startTime, {
operation: 'expensive_operation',
status: 'success'
});
return result;
} catch (error) {
metrics.histogram('operation_duration_ms', Date.now() - startTime, {
operation: 'expensive_operation',
status: 'failure'
});
throw error;
}
Priority 2: Enhanced Observability
[Similar format for medium priority points]
Priority 3: Deep Debugging
[Similar format for low priority points]
Logging Strategy
Log Levels by Environment
| Level | Development | Staging | Production |
|---|---|---|---|
| DEBUG | ✓ | ✓ | ✗ |
| INFO | ✓ | ✓ | Sampled |
| WARN | ✓ | ✓ | ✓ |
| ERROR | ✓ | ✓ | ✓ |
Sampling Configuration
sampling:
default: 0.01 # 1% sampling
rules:
- path: '/health'
sample_rate: 0.001 # 0.1% for health checks
- path: '/api/critical/*'
sample_rate: 0.1 # 10% for critical APIs
- level: 'ERROR'
sample_rate: 1.0 # 100% for errors
Metrics Implementation
Application Metrics
// Metric definitions
const metrics = {
// Counters
requests: new Counter('http_requests_total', ['method', 'path', 'status']),
errors: new Counter('errors_total', ['type', 'operation']),
// Histograms
duration: new Histogram('request_duration_ms', ['method', 'path']),
dbDuration: new Histogram('db_query_duration_ms', ['operation', 'table']),
// Gauges
connections: new Gauge('active_connections', ['type']),
queueSize: new Gauge('queue_size', ['queue_name'])
};
Dashboard Queries
-- Error rate by endpoint
SELECT
endpoint,
sum(errors) / sum(requests) as error_rate
FROM metrics
WHERE time > now() - 1h
GROUP BY endpoint
-- P95 latency
SELECT
endpoint,
percentile(duration, 0.95) as p95_latency
FROM metrics
WHERE time > now() - 1h
GROUP BY endpoint
Tracing Implementation
Trace Context
// Trace context propagation
class TraceContext {
constructor(traceId, spanId, parentSpanId) {
this.traceId = traceId || generateId();
this.spanId = spanId || generateId();
this.parentSpanId = parentSpanId;
}
createChild() {
return new TraceContext(this.traceId, generateId(), this.spanId);
}
}
// Usage
middleware.use((req, res, next) => {
req.trace = new TraceContext(
req.headers['x-trace-id'],
req.headers['x-span-id'],
req.headers['x-parent-span-id']
);
next();
});
Performance Considerations
Impact Analysis
| Instrumentation Type | CPU Impact | Memory Impact | I/O Impact |
|---|---|---|---|
| Structured Logging | < 1% | < 10MB | Async buffered |
| Metrics Collection | < 0.5% | < 5MB | Batched |
| Distributed Tracing | < 2% | < 20MB | Sampled |
Optimization Techniques
- Use async logging with buffers
- Implement sampling for high-frequency paths
- Batch metric submissions
- Use conditional compilation for debug logs
- Implement circuit breakers for logging systems
Implementation Plan
Phase 1: Week 1
- Implement critical error logging
- Add service boundary instrumentation
- Set up basic metrics
Phase 2: Week 2
- Add performance tracking
- Implement distributed tracing
- Create initial dashboards
Phase 3: Week 3
- Add business event tracking
- Implement sampling strategies
- Performance optimization
Monitoring & Alerts
Critical Alerts
- name: high_error_rate
condition: error_rate > 0.01
severity: critical
- name: high_latency
condition: p95_latency > 1000ms
severity: warning
- name: service_down
condition: health_check_failures > 3
severity: critical
Validation Checklist
- No sensitive data in logs
- Trace IDs properly propagated
- Sampling rates appropriate
- Performance impact acceptable
- Dashboards created
- Alerts configured
- Documentation updated
## Completion Criteria
- [ ] Critical paths identified
- [ ] Instrumentation points mapped
- [ ] Logging strategy defined
- [ ] Metrics designed
- [ ] Tracing plan created
- [ ] Performance impact assessed
- [ ] Implementation plan created
- [ ] Monitoring strategy defined