Monitoring
Comprehensive monitoring setup for MegaVault including health checks, metrics collection, logging, and alerting to ensure system reliability and performance.
Table of Contents
Monitoring Overview
Effective monitoring is crucial for maintaining MegaVault's reliability, performance, and user experience in production environments.
System Health
Infrastructure monitoring
- ✅ Server resources
- ✅ Application uptime
- ✅ Database performance
- ✅ Storage systems
Application Metrics
Business monitoring
- ✅ User activity
- ✅ File operations
- ✅ API performance
- ✅ Error rates
Alerting
Proactive notifications
- ✅ Real-time alerts
- ✅ Escalation policies
- ✅ Multiple channels
- ✅ On-call management
Monitoring Stack
Health Checks
Implement comprehensive health checks to monitor system components.
Application Health Endpoint
// /api/health/route.ts
export async function GET() {
const checks = await Promise.allSettled([
checkDatabase(),
checkStorage(),
checkRedis(),
checkExternalServices()
]);
const health = {
status: 'healthy',
timestamp: new Date().toISOString(),
checks: {
database: checks[0].status === 'fulfilled' ? 'healthy' : 'unhealthy',
storage: checks[1].status === 'fulfilled' ? 'healthy' : 'unhealthy',
redis: checks[2].status === 'fulfilled' ? 'healthy' : 'unhealthy',
external: checks[3].status === 'fulfilled' ? 'healthy' : 'unhealthy'
},
uptime: process.uptime(),
memory: process.memoryUsage(),
version: process.env.APP_VERSION
};
const isHealthy = Object.values(health.checks).every(status => status === 'healthy');
return Response.json(health, {
status: isHealthy ? 200 : 503
});
}External Monitoring
#!/bin/bash
# external-health-check.sh
ENDPOINT="https://your-domain.com/api/health"
SLACK_WEBHOOK="https://hooks.slack.com/your-webhook"
# Perform health check
if curl -f -s "$ENDPOINT" > /dev/null; then
echo "✓ Health check passed"
exit 0
else
echo "✗ Health check failed"
# Send alert to Slack
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-type: application/json' \
--data '{"text":"🚨 MegaVault health check failed"}'
exit 1
fiDocker Health Checks
# Add to Dockerfile
HEALTHCHECK --interval=30s --timeout=10s --start-period=30s --retries=3 CMD curl -f http://localhost:3000/api/health || exit 1Metrics Collection
Collect and analyze key performance metrics for system optimization.
Application Metrics
// metrics.ts
import { Counter, Histogram, Gauge } from 'prom-client';
export const metrics = {
httpRequests: new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status']
}),
requestDuration: new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'route'],
buckets: [0.1, 0.5, 1, 2, 5]
}),
activeUsers: new Gauge({
name: 'active_users_total',
help: 'Number of active users'
}),
fileUploads: new Counter({
name: 'file_uploads_total',
help: 'Total file uploads',
labelNames: ['status', 'file_type']
}),
storageUsed: new Gauge({
name: 'storage_used_bytes',
help: 'Storage space used in bytes'
})
};System Metrics
- CPU Usage: Track CPU utilization and load averages
- Memory Usage: Monitor RAM usage and garbage collection
- Disk I/O: Track disk read/write operations and space usage
- Network: Monitor network throughput and connection counts
Redis Metrics
# Redis metrics to monitor
redis-cli info stats | grep -E "(instantaneous_ops_per_sec|used_memory|connected_clients|total_commands_processed)"
# Example output:
# instantaneous_ops_per_sec:125
# used_memory:2048576
# connected_clients:10
# total_commands_processed:1000000Logging Strategy
Implement structured logging for effective debugging and monitoring.
Logging Configuration
// logger.ts
import winston from 'winston';
const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.errors({ stack: true }),
winston.format.json()
),
defaultMeta: {
service: 'megavault',
version: process.env.APP_VERSION
},
transports: [
new winston.transports.File({
filename: 'logs/error.log',
level: 'error'
}),
new winston.transports.File({
filename: 'logs/combined.log'
})
]
});
if (process.env.NODE_ENV !== 'production') {
logger.add(new winston.transports.Console({
format: winston.format.simple()
}));
}
export default logger;Log Levels and Categories
- Error: Application errors, exceptions, failures
- Warn: Unusual conditions, deprecated features
- Info: General application flow, major events
- Debug: Detailed debugging information
Log Aggregation
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/megavault/*.log
fields:
service: megavault
environment: production
output.elasticsearch:
hosts: ["elasticsearch:9200"]
index: "megavault-%{+yyyy.MM.dd}"
logging.level: infoAlerting Setup
Configure alerts for critical issues and performance degradation.
Alert Rules
groups:
- name: megavault-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate detected"
- alert: HighResponseTime
expr: histogram_quantile(0.95, http_request_duration_seconds) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High response time detected"
- alert: RedisDown
expr: up{job="redis"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Redis is down"Notification Channels
- Slack: Team notifications and collaboration
- Email: Critical alerts and summaries
- PagerDuty: On-call escalation management
- SMS: Emergency notifications
Alert Manager Configuration
global:
slack_api_url: 'https://hooks.slack.com/your-webhook'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
slack_configs:
- channel: '#alerts'
title: 'MegaVault Alert'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'Monitoring Dashboards
Create comprehensive dashboards for system visibility.
Key Dashboard Panels
- System Overview: CPU, memory, disk, network
- Application Metrics: Response times, error rates, throughput
- User Activity: Active users, file uploads, API usage
- Storage Metrics: Storage usage, file counts, transfer rates
- Database Performance: Redis metrics, connection pools
Grafana Dashboard JSON
{
"dashboard": {
"title": "MegaVault Monitoring",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "Requests/sec"
}
]
},
{
"title": "Error Rate",
"type": "singlestat",
"targets": [
{
"expr": "rate(http_requests_total{status=~"5.."}[5m])",
"legendFormat": "Error Rate"
}
]
}
]
}
}Custom Metrics Dashboard
- File Operations: Upload/download rates and success rates
- User Engagement: Daily/monthly active users
- Storage Growth: Storage usage trends and forecasting
- Performance: API response times and database query performance
Troubleshooting
Common monitoring issues and debugging strategies.
Common Issues
Missing Metrics
Issue: Metrics not appearing in dashboards
Solutions:
- Check metrics endpoint accessibility
- Verify Prometheus scrape configuration
- Validate metric names and labels
- Check network connectivity
Alert Fatigue
Issue: Too many false positive alerts
Solutions:
- Adjust alert thresholds
- Implement alert grouping
- Add meaningful context
- Review alert rules regularly
Debugging Commands
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets
# Query specific metric
curl "http://localhost:9090/api/v1/query?query=up"
# Check AlertManager status
curl http://localhost:9093/api/v1/status
# Test health endpoint
curl -v https://your-domain.com/api/health
# Check log files
tail -f /var/log/megavault/error.log