9.5 KiB

Raw Blame History

instrumentation-analysis

Design strategic logging and monitoring points for production debugging.

Context

This task analyzes code to identify optimal locations for instrumentation (logging, metrics, tracing) that will aid in debugging production issues without impacting performance. It creates a comprehensive observability strategy.

Task Execution

Phase 1: Critical Path Analysis

Identify Key Flows

User-Facing Paths: Request → Response chains
Business-Critical Paths: Payment, authentication, data processing
Performance-Sensitive Paths: High-frequency operations
Error-Prone Paths: Historical problem areas
Integration Points: External service calls

Map Decision Points

Conditional branches with business logic
State transitions
Error handling blocks
Retry mechanisms
Circuit breakers
Cache hits/misses

Phase 2: Instrumentation Strategy

Level 1: Essential Instrumentation

Entry/Exit Points:

- Service boundaries (API endpoints)
- Function entry/exit for critical operations
- Database transaction boundaries
- External service calls
- Message queue operations

Error Conditions:

- Exception catches
- Validation failures
- Timeout occurrences
- Retry attempts
- Fallback activations

Performance Markers:

- Operation start/end times
- Queue depths
- Resource utilization
- Batch sizes
- Cache effectiveness

Level 2: Diagnostic Instrumentation

State Changes:

- User state transitions
- Order/payment status changes
- Configuration updates
- Feature flag toggles
- Circuit breaker state changes

Business Events:

- User actions (login, purchase, etc.)
- System events (startup, shutdown)
- Scheduled job execution
- Data pipeline stages
- Workflow transitions

Level 3: Deep Debugging

Detailed Tracing:

- Parameter values for complex functions
- Intermediate calculation results
- Loop iteration counts
- Branch decisions
- SQL query parameters

Phase 3: Implementation Patterns

Structured Logging Format

Standard Fields:

{
  "timestamp": "ISO-8601",
  "level": "INFO|WARN|ERROR",
  "service": "service-name",
  "trace_id": "correlation-id",
  "span_id": "operation-id",
  "user_id": "if-applicable",
  "operation": "what-is-happening",
  "duration_ms": "for-completed-ops",
  "status": "success|failure",
  "error": "error-details-if-any",
  "metadata": {
    "custom": "fields"
  }
}

Performance-Conscious Patterns

Sampling Strategy:

- 100% for errors
- 10% for normal operations
- 1% for high-frequency paths
- Dynamic adjustment based on load

Async Logging:

- Buffer non-critical logs
- Batch write to reduce I/O
- Use separate thread/process
- Implement backpressure handling

Conditional Logging:

- Debug level only in development
- Info level in staging
- Warn/Error in production
- Dynamic level adjustment via config

Phase 4: Metrics Design

Key Metrics to Track

RED Metrics:

Rate: Requests per second
Errors: Error rate/count
Duration: Response time distribution

USE Metrics:

Utilization: Resource usage percentage
Saturation: Queue depth, wait time
Errors: Resource allocation failures

Business Metrics:

Transaction success rate
Feature usage
User journey completion
Revenue impact

Metric Implementation

Counter Examples:

requests_total{method="GET", endpoint="/api/users", status="200"}
errors_total{type="database", operation="insert"}

Histogram Examples:

request_duration_seconds{method="GET", endpoint="/api/users"}
database_query_duration_ms{query_type="select", table="users"}

Gauge Examples:

active_connections{service="database"}
queue_depth{queue="email"}

Phase 5: Tracing Strategy

Distributed Tracing Points

Span Creation:

- HTTP request handling
- Database operations
- Cache operations
- External API calls
- Message publishing/consuming
- Background job execution

Context Propagation:

- HTTP headers (X-Trace-Id)
- Message metadata
- Database comments
- Log correlation

Output Format

# Instrumentation Analysis Report

## Executive Summary

**Components Analyzed:** [count]
**Current Coverage:** [percentage]
**Recommended Additions:** [count]
**Performance Impact:** [minimal/low/moderate]
**Implementation Effort:** [hours/days]

## Critical Instrumentation Points

### Priority 1: Immediate Implementation

#### Service: [ServiceName]

**Entry Points:**

```[language]
// Location: [file:line]
// Current: No logging
// Recommended:
logger.info("Request received", {
  method: req.method,
  path: req.path,
  user_id: req.user?.id,
  trace_id: req.traceId
});
```

Error Handling:

// Location: [file:line]
// Current: Silent failure
// Recommended:
logger.error("Database operation failed", {
  operation: "user_update",
  user_id: userId,
  error: err.message,
  stack: err.stack,
  retry_count: retries
});

Performance Tracking:

// Location: [file:line]
// Recommended:
const startTime = Date.now();
try {
  const result = await expensiveOperation();
  metrics.histogram('operation_duration_ms', Date.now() - startTime, {
    operation: 'expensive_operation',
    status: 'success'
  });
  return result;
} catch (error) {
  metrics.histogram('operation_duration_ms', Date.now() - startTime, {
    operation: 'expensive_operation',
    status: 'failure'
  });
  throw error;
}

Priority 2: Enhanced Observability

[Similar format for medium priority points]

Priority 3: Deep Debugging

[Similar format for low priority points]

Logging Strategy

Log Levels by Environment

Level	Development	Staging	Production
DEBUG	✓	✓	✗
INFO	✓	✓	Sampled
WARN	✓	✓	✓
ERROR	✓	✓	✓

Sampling Configuration

sampling:
  default: 0.01 # 1% sampling
  rules:
    - path: '/health'
      sample_rate: 0.001 # 0.1% for health checks
    - path: '/api/critical/*'
      sample_rate: 0.1 # 10% for critical APIs
    - level: 'ERROR'
      sample_rate: 1.0 # 100% for errors

Metrics Implementation

Application Metrics

// Metric definitions
const metrics = {
  // Counters
  requests: new Counter('http_requests_total', ['method', 'path', 'status']),
  errors: new Counter('errors_total', ['type', 'operation']),

  // Histograms
  duration: new Histogram('request_duration_ms', ['method', 'path']),
  dbDuration: new Histogram('db_query_duration_ms', ['operation', 'table']),

  // Gauges
  connections: new Gauge('active_connections', ['type']),
  queueSize: new Gauge('queue_size', ['queue_name'])
};

Dashboard Queries

-- Error rate by endpoint
SELECT
  endpoint,
  sum(errors) / sum(requests) as error_rate
FROM metrics
WHERE time > now() - 1h
GROUP BY endpoint

-- P95 latency
SELECT
  endpoint,
  percentile(duration, 0.95) as p95_latency
FROM metrics
WHERE time > now() - 1h
GROUP BY endpoint

Tracing Implementation

Trace Context

// Trace context propagation
class TraceContext {
  constructor(traceId, spanId, parentSpanId) {
    this.traceId = traceId || generateId();
    this.spanId = spanId || generateId();
    this.parentSpanId = parentSpanId;
  }

  createChild() {
    return new TraceContext(this.traceId, generateId(), this.spanId);
  }
}

// Usage
middleware.use((req, res, next) => {
  req.trace = new TraceContext(
    req.headers['x-trace-id'],
    req.headers['x-span-id'],
    req.headers['x-parent-span-id']
  );
  next();
});

Performance Considerations

Impact Analysis

Instrumentation Type	CPU Impact	Memory Impact	I/O Impact
Structured Logging	< 1%	< 10MB	Async buffered
Metrics Collection	< 0.5%	< 5MB	Batched
Distributed Tracing	< 2%	< 20MB	Sampled

Optimization Techniques

Use async logging with buffers
Implement sampling for high-frequency paths
Batch metric submissions
Use conditional compilation for debug logs
Implement circuit breakers for logging systems

Implementation Plan

Phase 1: Week 1

Implement critical error logging
Add service boundary instrumentation
Set up basic metrics

Phase 2: Week 2

Add performance tracking
Implement distributed tracing
Create initial dashboards

Phase 3: Week 3

Add business event tracking
Implement sampling strategies
Performance optimization

Monitoring & Alerts

Critical Alerts

- name: high_error_rate
  condition: error_rate > 0.01
  severity: critical

- name: high_latency
  condition: p95_latency > 1000ms
  severity: warning

- name: service_down
  condition: health_check_failures > 3
  severity: critical

Validation Checklist

No sensitive data in logs
Trace IDs properly propagated
Sampling rates appropriate
Performance impact acceptable
Dashboards created
Alerts configured
Documentation updated


## Completion Criteria
- [ ] Critical paths identified
- [ ] Instrumentation points mapped
- [ ] Logging strategy defined
- [ ] Metrics designed
- [ ] Tracing plan created
- [ ] Performance impact assessed
- [ ] Implementation plan created
- [ ] Monitoring strategy defined

9.5 KiB Raw Blame History