Observability Stack
Overviewβ
The application includes a complete observability stack with Prometheus for metrics collection, Grafana for visualization, and OpenTelemetry (OTEL) for distributed tracing. This enables monitoring, alerting, and troubleshooting of your applications.
What is Observability?β
Observability answers three core questions:
- Metrics: "What is happening?" (CPU, memory, request counts)
- Logs: "Why is it happening?" (Error messages, application events)
- Traces: "How did we get here?" (Request flow across services)
Stack Componentsβ
Prometheusβ
Prometheus is a time-series database that scrapes metrics from applications at regular intervals.
Features:
- Pull-based metrics collection
- Time-series data storage
- PromQL query language
- Alerting rules
- 15-day default retention
Grafanaβ
Grafana provides beautiful dashboards and alerts on top of Prometheus data.
Features:
- Interactive dashboards
- Multi-source support (Prometheus, Loki, etc.)
- Alert notifications (email, Slack, PagerDuty)
- User management and roles
- Dashboard versioning
OpenTelemetry (OTEL)β
OpenTelemetry collects traces, metrics, and logs from your application with minimal code changes.
Benefits:
- Vendor-neutral instrumentation standard
- Distributed tracing across services
- Automatic context propagation
- Works with any backend (Jaeger, Tempo, etc.)
Starting the Observability Stackβ
Start Everythingβ
# Start main stack + observability
make up-lgtm
# or directly with Docker Compose
docker-compose -f docker-compose.yaml \
-f docker-compose.observability.yaml up -d
Access Pointsβ
Prometheus: http://localhost:9090
Grafana: http://localhost:3000/grafana (username: admin, password: admin)
OTEL Collector: localhost:4317 (GRPC)
Configuration Filesβ
infra/observability/prometheus.ymlβ
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'api'
static_configs:
- targets: ['api:3001']
metrics_path: '/metrics'
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
infra/observability/otelcol-config.yamlβ
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
exporters:
prometheus:
endpoint: '0.0.0.0:8888'
jaeger:
endpoint: localhost:14250
processors:
batch:
timeout: 10s
send_batch_size: 1024
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
Instrumenting Your Applicationβ
NestJS API Instrumentationβ
Install OTEL dependencies:
pnpm add --save-exact \
@opentelemetry/api \
@opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-metrics-prometheus \
@opentelemetry/exporter-trace-jaeger \
@opentelemetry/sdk-metrics \
@opentelemetry/sdk-trace-node
Configure in apps/api/src/main.ts:
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { PrometheusExporter } from '@opentelemetry/exporter-metrics-prometheus';
import { JaegerExporter } from '@opentelemetry/exporter-trace-jaeger';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
const sdk = new NodeSDK({
instrumentations: [getNodeAutoInstrumentations()],
traceExporter: new JaegerExporter({
endpoint: 'http://localhost:14268/api/traces',
}),
metricReader: new PrometheusExporter({
port: 8888,
}),
});
sdk.start();
// Add graceful shutdown
process.on('SIGTERM', () => {
sdk.shutdown().then(() => {
process.exit(0);
});
});
Custom Metricsβ
Define custom metrics in your services:
import { metrics } from '@opentelemetry/api';
const meter = metrics.getMeter('qms-api');
// Counter
const requestCounter = meter.createCounter('http_requests_total', {
description: 'Total HTTP requests',
});
// Histogram
const requestDuration = meter.createHistogram('http_request_duration_seconds', {
description: 'HTTP request duration in seconds',
});
// Use in controller
export class EmployeeController {
@Get()
getEmployees() {
const start = performance.now();
// ... get employees ...
const duration = (performance.now() - start) / 1000;
this.requestDuration.record(duration);
this.requestCounter.add(1);
return employees;
}
}
Custom Spans (Tracing)β
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('qms-api');
export class UserService {
async createUser(data: CreateUserDto) {
const span = tracer.startSpan('user.create');
try {
span.setAttributes({
'user.email': data.email,
'user.department': data.department,
});
const user = await this.prisma.user.create({ data });
span.setStatus({ code: SpanStatusCode.OK });
return user;
} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw error;
} finally {
span.end();
}
}
}
Prometheus Queriesβ
Common PromQL Queriesβ
# Request rate (requests per second)
rate(http_requests_total[5m])
# Request duration 95th percentile
histogram_quantile(0.95, http_request_duration_seconds_bucket)
# Error rate percentage
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
# API endpoint response time
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
# Memory usage
process_resident_memory_bytes
# Database connection pool utilization
db_connection_pool_utilization
# CPU usage
rate(process_cpu_seconds_total[5m]) * 100
Grafana Dashboardsβ
Creating Dashboardsβ
- Access Grafana: http://localhost:3000/grafana
- Login: admin / admin
- Create β Dashboard
- Add Panel
- Select Prometheus as data source
- Write PromQL query
Example Dashboard JSONβ
{
"dashboard": {
"title": "QMS API Overview",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])"
}
],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
}
],
"type": "stat"
}
]
}
}
Alertingβ
Define Alerts in Prometheusβ
Edit prometheus.yml:
rule_files:
- /etc/prometheus/alert.rules.yml
alert:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
Create alert.rules.yml:
groups:
- name: api_alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
annotations:
summary: 'High error rate detected'
description: 'Error rate is {{ $value | humanizePercentage }}'
- alert: APIDown
expr: up{job="api"} == 0
for: 1m
annotations:
summary: 'API is down'
Grafana Alert Notificationsβ
- Configuration β Notification channels
- Add channel (Slack, Email, PagerDuty, etc.)
- Use in alert rules
Performance Monitoringβ
Key Metrics to Monitorβ
| Metric | Normal Range | Alert Threshold |
|---|---|---|
| Request Latency (p95) | < 200ms | > 500ms |
| Error Rate | < 0.1% | > 1% |
| CPU Usage | < 50% | > 80% |
| Memory Usage | < 60% | > 85% |
| Disk Usage | < 70% | > 90% |
| Database Connections | < 80% | > 95% |
Example: Track API Performanceβ
Dashboard panels:
1. Request Volume (5m rate)
2. Response Time (p50, p95, p99)
3. Error Rate
4. Database Query Duration
5. Memory Usage
Troubleshootingβ
Prometheus Scrape Issuesβ
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets
# Verify metrics endpoint
curl http://localhost:3001/metrics
No Data in Grafanaβ
# Check data source connection
Grafana β Configuration β Data Sources β Prometheus β Test
# Verify metrics are being collected
docker-compose logs prometheus | grep "scrape"
High Memory Usageβ
# Check Prometheus storage
du -sh /prometheus
# Adjust retention policy in prometheus.yml
global:
retention: 7d # default is 15d
Missing Tracesβ
# Check OTEL collector configuration
docker-compose logs otel-collector
# Verify OTEL endpoint reachability
telnet localhost 4317
# Check application logs for OTEL errors
docker-compose logs api | grep -i otel
Best Practicesβ
1. Use Meaningful Metric Namesβ
// β
GOOD: Clear, specific
const orderProcessingDuration = meter.createHistogram('order_processing_duration_seconds');
// β AVOID: Too generic
const duration = meter.createHistogram('duration');
2. Add Context with Attributesβ
span.setAttributes({
'user.id': userId,
'order.id': orderId,
'operation.type': 'checkout',
});
3. Monitor Business Metricsβ
// Application-level metrics
const ordersCreated = meter.createCounter('orders_created_total');
const checkoutValue = meter.createHistogram('checkout_value');
ordersCreated.add(1);
checkoutValue.record(orderTotal);
4. Set Appropriate Alert Thresholdsβ
- Monitor trends over time before setting alerts
- Avoid alert fatigue with realistic thresholds
- Include runbooks in alert descriptions
5. Keep Retention Manageableβ
# Balance data retention vs. storage
global:
retention: 15d # Adjust based on storage capacity
Data Exportβ
Export Metricsβ
# Query and export as JSON
curl 'http://localhost:9090/api/v1/query?query=up' > metrics.json
# Backup Prometheus data
docker cp container_id:/prometheus ./prometheus-backup
Grafana Dashboardsβ
# Export dashboard as JSON
Grafana Dashboard Menu β Share β Export JSON
# Version control dashboards
git add-commit dashboards/*.json
Advanced Topicsβ
- Distributed Tracing with Jaeger
- Alert Management with AlertManager
- OpenTelemetry Best Practices
- Custom Exporters
Quick Referenceβ
# Start observability stack
make up-lgtm
# View Prometheus targets
http://localhost:9090/targets
# Query metrics
http://localhost:9090/graph
# Access Grafana
http://localhost:3000/grafana
# Restart services
docker-compose restart prometheus grafana
# View logs
docker-compose logs -f otel-collector