Skip to main content

Observability Stack

Overview​

The application includes a complete observability stack with Prometheus for metrics collection, Grafana for visualization, and OpenTelemetry (OTEL) for distributed tracing. This enables monitoring, alerting, and troubleshooting of your applications.

What is Observability?​

Observability answers three core questions:

  1. Metrics: "What is happening?" (CPU, memory, request counts)
  2. Logs: "Why is it happening?" (Error messages, application events)
  3. Traces: "How did we get here?" (Request flow across services)

Stack Components​

Prometheus​

Prometheus is a time-series database that scrapes metrics from applications at regular intervals.

Features:

  • Pull-based metrics collection
  • Time-series data storage
  • PromQL query language
  • Alerting rules
  • 15-day default retention

Grafana​

Grafana provides beautiful dashboards and alerts on top of Prometheus data.

Features:

  • Interactive dashboards
  • Multi-source support (Prometheus, Loki, etc.)
  • Alert notifications (email, Slack, PagerDuty)
  • User management and roles
  • Dashboard versioning

OpenTelemetry (OTEL)​

OpenTelemetry collects traces, metrics, and logs from your application with minimal code changes.

Benefits:

  • Vendor-neutral instrumentation standard
  • Distributed tracing across services
  • Automatic context propagation
  • Works with any backend (Jaeger, Tempo, etc.)

Starting the Observability Stack​

Start Everything​

# Start main stack + observability
make up-lgtm

# or directly with Docker Compose
docker-compose -f docker-compose.yaml \
-f docker-compose.observability.yaml up -d

Access Points​

Prometheus: http://localhost:9090
Grafana: http://localhost:3000/grafana (username: admin, password: admin)
OTEL Collector: localhost:4317 (GRPC)

Configuration Files​

infra/observability/prometheus.yml​

global:
scrape_interval: 15s
evaluation_interval: 15s

scrape_configs:
- job_name: 'api'
static_configs:
- targets: ['api:3001']
metrics_path: '/metrics'

- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']

infra/observability/otelcol-config.yaml​

receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318

exporters:
prometheus:
endpoint: '0.0.0.0:8888'
jaeger:
endpoint: localhost:14250

processors:
batch:
timeout: 10s
send_batch_size: 1024

service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]

Instrumenting Your Application​

NestJS API Instrumentation​

Install OTEL dependencies:

pnpm add --save-exact \
@opentelemetry/api \
@opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-metrics-prometheus \
@opentelemetry/exporter-trace-jaeger \
@opentelemetry/sdk-metrics \
@opentelemetry/sdk-trace-node

Configure in apps/api/src/main.ts:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { PrometheusExporter } from '@opentelemetry/exporter-metrics-prometheus';
import { JaegerExporter } from '@opentelemetry/exporter-trace-jaeger';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';

const sdk = new NodeSDK({
instrumentations: [getNodeAutoInstrumentations()],
traceExporter: new JaegerExporter({
endpoint: 'http://localhost:14268/api/traces',
}),
metricReader: new PrometheusExporter({
port: 8888,
}),
});

sdk.start();

// Add graceful shutdown
process.on('SIGTERM', () => {
sdk.shutdown().then(() => {
process.exit(0);
});
});

Custom Metrics​

Define custom metrics in your services:

import { metrics } from '@opentelemetry/api';

const meter = metrics.getMeter('qms-api');

// Counter
const requestCounter = meter.createCounter('http_requests_total', {
description: 'Total HTTP requests',
});

// Histogram
const requestDuration = meter.createHistogram('http_request_duration_seconds', {
description: 'HTTP request duration in seconds',
});

// Use in controller
export class EmployeeController {
@Get()
getEmployees() {
const start = performance.now();

// ... get employees ...

const duration = (performance.now() - start) / 1000;
this.requestDuration.record(duration);
this.requestCounter.add(1);

return employees;
}
}

Custom Spans (Tracing)​

import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('qms-api');

export class UserService {
async createUser(data: CreateUserDto) {
const span = tracer.startSpan('user.create');

try {
span.setAttributes({
'user.email': data.email,
'user.department': data.department,
});

const user = await this.prisma.user.create({ data });
span.setStatus({ code: SpanStatusCode.OK });
return user;
} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw error;
} finally {
span.end();
}
}
}

Prometheus Queries​

Common PromQL Queries​

# Request rate (requests per second)
rate(http_requests_total[5m])

# Request duration 95th percentile
histogram_quantile(0.95, http_request_duration_seconds_bucket)

# Error rate percentage
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100

# API endpoint response time
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

# Memory usage
process_resident_memory_bytes

# Database connection pool utilization
db_connection_pool_utilization

# CPU usage
rate(process_cpu_seconds_total[5m]) * 100

Grafana Dashboards​

Creating Dashboards​

  1. Access Grafana: http://localhost:3000/grafana
  2. Login: admin / admin
  3. Create β†’ Dashboard
  4. Add Panel
  5. Select Prometheus as data source
  6. Write PromQL query

Example Dashboard JSON​

{
"dashboard": {
"title": "QMS API Overview",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])"
}
],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
}
],
"type": "stat"
}
]
}
}

Alerting​

Define Alerts in Prometheus​

Edit prometheus.yml:

rule_files:
- /etc/prometheus/alert.rules.yml

alert:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']

Create alert.rules.yml:

groups:
- name: api_alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
annotations:
summary: 'High error rate detected'
description: 'Error rate is {{ $value | humanizePercentage }}'

- alert: APIDown
expr: up{job="api"} == 0
for: 1m
annotations:
summary: 'API is down'

Grafana Alert Notifications​

  1. Configuration β†’ Notification channels
  2. Add channel (Slack, Email, PagerDuty, etc.)
  3. Use in alert rules

Performance Monitoring​

Key Metrics to Monitor​

MetricNormal RangeAlert Threshold
Request Latency (p95)< 200ms> 500ms
Error Rate< 0.1%> 1%
CPU Usage< 50%> 80%
Memory Usage< 60%> 85%
Disk Usage< 70%> 90%
Database Connections< 80%> 95%

Example: Track API Performance​

Dashboard panels:

1. Request Volume (5m rate)
2. Response Time (p50, p95, p99)
3. Error Rate
4. Database Query Duration
5. Memory Usage

Troubleshooting​

Prometheus Scrape Issues​

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets

# Verify metrics endpoint
curl http://localhost:3001/metrics

No Data in Grafana​

# Check data source connection
Grafana β†’ Configuration β†’ Data Sources β†’ Prometheus β†’ Test

# Verify metrics are being collected
docker-compose logs prometheus | grep "scrape"

High Memory Usage​

# Check Prometheus storage
du -sh /prometheus

# Adjust retention policy in prometheus.yml
global:
retention: 7d # default is 15d

Missing Traces​

# Check OTEL collector configuration
docker-compose logs otel-collector

# Verify OTEL endpoint reachability
telnet localhost 4317

# Check application logs for OTEL errors
docker-compose logs api | grep -i otel

Best Practices​

1. Use Meaningful Metric Names​

// βœ… GOOD: Clear, specific
const orderProcessingDuration = meter.createHistogram('order_processing_duration_seconds');

// ❌ AVOID: Too generic
const duration = meter.createHistogram('duration');

2. Add Context with Attributes​

span.setAttributes({
'user.id': userId,
'order.id': orderId,
'operation.type': 'checkout',
});

3. Monitor Business Metrics​

// Application-level metrics
const ordersCreated = meter.createCounter('orders_created_total');
const checkoutValue = meter.createHistogram('checkout_value');

ordersCreated.add(1);
checkoutValue.record(orderTotal);

4. Set Appropriate Alert Thresholds​

  • Monitor trends over time before setting alerts
  • Avoid alert fatigue with realistic thresholds
  • Include runbooks in alert descriptions

5. Keep Retention Manageable​

# Balance data retention vs. storage
global:
retention: 15d # Adjust based on storage capacity

Data Export​

Export Metrics​

# Query and export as JSON
curl 'http://localhost:9090/api/v1/query?query=up' > metrics.json

# Backup Prometheus data
docker cp container_id:/prometheus ./prometheus-backup

Grafana Dashboards​

# Export dashboard as JSON
Grafana Dashboard Menu β†’ Share β†’ Export JSON

# Version control dashboards
git add-commit dashboards/*.json

Advanced Topics​

Quick Reference​

# Start observability stack
make up-lgtm

# View Prometheus targets
http://localhost:9090/targets

# Query metrics
http://localhost:9090/graph

# Access Grafana
http://localhost:3000/grafana

# Restart services
docker-compose restart prometheus grafana

# View logs
docker-compose logs -f otel-collector