Monitoring

Comprehensive monitoring setup for MegaVault including health checks, metrics collection, logging, and alerting to ensure system reliability and performance.

Monitoring Overview

Effective monitoring is crucial for maintaining MegaVault's reliability, performance, and user experience in production environments.

System Health

Infrastructure monitoring

  • ✅ Server resources
  • ✅ Application uptime
  • ✅ Database performance
  • ✅ Storage systems

Application Metrics

Business monitoring

  • ✅ User activity
  • ✅ File operations
  • ✅ API performance
  • ✅ Error rates

Alerting

Proactive notifications

  • ✅ Real-time alerts
  • ✅ Escalation policies
  • ✅ Multiple channels
  • ✅ On-call management
💡

Monitoring Stack

MegaVault supports multiple monitoring solutions including Prometheus, Grafana, Sentry, and cloud-native monitoring services.

Health Checks

Implement comprehensive health checks to monitor system components.

Application Health Endpoint

Health Check API
// /api/health/route.ts
export async function GET() {
  const checks = await Promise.allSettled([
    checkDatabase(),
    checkStorage(),
    checkRedis(),
    checkExternalServices()
  ]);

  const health = {
    status: 'healthy',
    timestamp: new Date().toISOString(),
    checks: {
      database: checks[0].status === 'fulfilled' ? 'healthy' : 'unhealthy',
      storage: checks[1].status === 'fulfilled' ? 'healthy' : 'unhealthy',
      redis: checks[2].status === 'fulfilled' ? 'healthy' : 'unhealthy',
      external: checks[3].status === 'fulfilled' ? 'healthy' : 'unhealthy'
    },
    uptime: process.uptime(),
    memory: process.memoryUsage(),
    version: process.env.APP_VERSION
  };

  const isHealthy = Object.values(health.checks).every(status => status === 'healthy');
  
  return Response.json(health, { 
    status: isHealthy ? 200 : 503 
  });
}

External Monitoring

Health Check Script
#!/bin/bash
# external-health-check.sh

ENDPOINT="https://your-domain.com/api/health"
SLACK_WEBHOOK="https://hooks.slack.com/your-webhook"

# Perform health check
if curl -f -s "$ENDPOINT" > /dev/null; then
    echo "✓ Health check passed"
    exit 0
else
    echo "✗ Health check failed"
    
    # Send alert to Slack
    curl -X POST "$SLACK_WEBHOOK" \
        -H 'Content-type: application/json' \
        --data '{"text":"🚨 MegaVault health check failed"}'
    
    exit 1
fi

Docker Health Checks

Dockerfile Health Check
# Add to Dockerfile
HEALTHCHECK --interval=30s --timeout=10s --start-period=30s --retries=3   CMD curl -f http://localhost:3000/api/health || exit 1

Metrics Collection

Collect and analyze key performance metrics for system optimization.

Application Metrics

Metrics Collection
// metrics.ts
import { Counter, Histogram, Gauge } from 'prom-client';

export const metrics = {
  httpRequests: new Counter({
    name: 'http_requests_total',
    help: 'Total HTTP requests',
    labelNames: ['method', 'route', 'status']
  }),
  
  requestDuration: new Histogram({
    name: 'http_request_duration_seconds',
    help: 'HTTP request duration',
    labelNames: ['method', 'route'],
    buckets: [0.1, 0.5, 1, 2, 5]
  }),
  
  activeUsers: new Gauge({
    name: 'active_users_total',
    help: 'Number of active users'
  }),
  
  fileUploads: new Counter({
    name: 'file_uploads_total',
    help: 'Total file uploads',
    labelNames: ['status', 'file_type']
  }),
  
  storageUsed: new Gauge({
    name: 'storage_used_bytes',
    help: 'Storage space used in bytes'
  })
};

System Metrics

  • CPU Usage: Track CPU utilization and load averages
  • Memory Usage: Monitor RAM usage and garbage collection
  • Disk I/O: Track disk read/write operations and space usage
  • Network: Monitor network throughput and connection counts

Redis Metrics

Redis Metrics Collection
# Redis metrics to monitor
redis-cli info stats | grep -E "(instantaneous_ops_per_sec|used_memory|connected_clients|total_commands_processed)"

# Example output:
# instantaneous_ops_per_sec:125
# used_memory:2048576
# connected_clients:10
# total_commands_processed:1000000

Logging Strategy

Implement structured logging for effective debugging and monitoring.

Logging Configuration

Logger Setup
// logger.ts
import winston from 'winston';

const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: {
    service: 'megavault',
    version: process.env.APP_VERSION
  },
  transports: [
    new winston.transports.File({ 
      filename: 'logs/error.log', 
      level: 'error' 
    }),
    new winston.transports.File({ 
      filename: 'logs/combined.log' 
    })
  ]
});

if (process.env.NODE_ENV !== 'production') {
  logger.add(new winston.transports.Console({
    format: winston.format.simple()
  }));
}

export default logger;

Log Levels and Categories

  • Error: Application errors, exceptions, failures
  • Warn: Unusual conditions, deprecated features
  • Info: General application flow, major events
  • Debug: Detailed debugging information

Log Aggregation

Log Shipping with Filebeat
# filebeat.yml
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/megavault/*.log
  fields:
    service: megavault
    environment: production

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "megavault-%{+yyyy.MM.dd}"

logging.level: info

Alerting Setup

Configure alerts for critical issues and performance degradation.

Alert Rules

Prometheus Alert Rules
groups:
- name: megavault-alerts
  rules:
  - alert: HighErrorRate
    expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "High error rate detected"
      
  - alert: HighResponseTime
    expr: histogram_quantile(0.95, http_request_duration_seconds) > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High response time detected"
      
  - alert: RedisDown
    expr: up{job="redis"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Redis is down"

Notification Channels

  • Slack: Team notifications and collaboration
  • Email: Critical alerts and summaries
  • PagerDuty: On-call escalation management
  • SMS: Emergency notifications

Alert Manager Configuration

AlertManager Config
global:
  slack_api_url: 'https://hooks.slack.com/your-webhook'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
- name: 'web.hook'
  slack_configs:
  - channel: '#alerts'
    title: 'MegaVault Alert'
    text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

Monitoring Dashboards

Create comprehensive dashboards for system visibility.

Key Dashboard Panels

  • System Overview: CPU, memory, disk, network
  • Application Metrics: Response times, error rates, throughput
  • User Activity: Active users, file uploads, API usage
  • Storage Metrics: Storage usage, file counts, transfer rates
  • Database Performance: Redis metrics, connection pools

Grafana Dashboard JSON

Sample Dashboard Panel
{
  "dashboard": {
    "title": "MegaVault Monitoring",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])",
            "legendFormat": "Requests/sec"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "singlestat",
        "targets": [
          {
            "expr": "rate(http_requests_total{status=~"5.."}[5m])",
            "legendFormat": "Error Rate"
          }
        ]
      }
    ]
  }
}

Custom Metrics Dashboard

  • File Operations: Upload/download rates and success rates
  • User Engagement: Daily/monthly active users
  • Storage Growth: Storage usage trends and forecasting
  • Performance: API response times and database query performance

Troubleshooting

Common monitoring issues and debugging strategies.

Common Issues

Missing Metrics

Issue: Metrics not appearing in dashboards

Solutions:

  • Check metrics endpoint accessibility
  • Verify Prometheus scrape configuration
  • Validate metric names and labels
  • Check network connectivity

Alert Fatigue

Issue: Too many false positive alerts

Solutions:

  • Adjust alert thresholds
  • Implement alert grouping
  • Add meaningful context
  • Review alert rules regularly

Debugging Commands

Monitoring Debug Commands
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets

# Query specific metric
curl "http://localhost:9090/api/v1/query?query=up"

# Check AlertManager status
curl http://localhost:9093/api/v1/status

# Test health endpoint
curl -v https://your-domain.com/api/health

# Check log files
tail -f /var/log/megavault/error.log
⚠️

Production Monitoring

In production, ensure monitoring systems are highly available and independent of the application they monitor to avoid single points of failure.