Observability Stack

Overview

The application includes a complete observability stack with Prometheus for metrics collection, Grafana for visualization, and OpenTelemetry (OTEL) for distributed tracing. This enables monitoring, alerting, and troubleshooting of your applications.

What is Observability?

Observability answers three core questions:

Metrics: "What is happening?" (CPU, memory, request counts)
Logs: "Why is it happening?" (Error messages, application events)
Traces: "How did we get here?" (Request flow across services)

Stack Components

Prometheus

Prometheus is a time-series database that scrapes metrics from applications at regular intervals.

Features:

Pull-based metrics collection
Time-series data storage
PromQL query language
Alerting rules
15-day default retention

Grafana

Grafana provides beautiful dashboards and alerts on top of Prometheus data.

Features:

Interactive dashboards
Multi-source support (Prometheus, Loki, etc.)
Alert notifications (email, Slack, PagerDuty)
User management and roles
Dashboard versioning

OpenTelemetry (OTEL)

OpenTelemetry collects traces, metrics, and logs from your application with minimal code changes.

Benefits:

Vendor-neutral instrumentation standard
Distributed tracing across services
Automatic context propagation
Works with any backend (Jaeger, Tempo, etc.)

Starting the Observability Stack

Start Everything

# Start main stack + observability
make up-lgtm

# or directly with Docker Compose
docker-compose -f docker-compose.yaml \
  -f docker-compose.observability.yaml up -d

Access Points

Prometheus: http://localhost:9090
Grafana:    http://localhost:3000/grafana (username: admin, password: admin)
OTEL Collector: localhost:4317 (GRPC)

Configuration Files

`infra/observability/prometheus.yml`

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'api'
    static_configs:
      - targets: ['api:3001']
    metrics_path: '/metrics'

  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']

`infra/observability/otelcol-config.yaml`

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

exporters:
  prometheus:
    endpoint: '0.0.0.0:8888'
  jaeger:
    endpoint: localhost:14250

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Instrumenting Your Application

NestJS API Instrumentation

Install OTEL dependencies:

pnpm add --save-exact \
  @opentelemetry/api \
  @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-metrics-prometheus \
  @opentelemetry/exporter-trace-jaeger \
  @opentelemetry/sdk-metrics \
  @opentelemetry/sdk-trace-node

Configure in apps/api/src/main.ts:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { PrometheusExporter } from '@opentelemetry/exporter-metrics-prometheus';
import { JaegerExporter } from '@opentelemetry/exporter-trace-jaeger';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';

const sdk = new NodeSDK({
  instrumentations: [getNodeAutoInstrumentations()],
  traceExporter: new JaegerExporter({
    endpoint: 'http://localhost:14268/api/traces',
  }),
  metricReader: new PrometheusExporter({
    port: 8888,
  }),
});

sdk.start();

// Add graceful shutdown
process.on('SIGTERM', () => {
  sdk.shutdown().then(() => {
    process.exit(0);
  });
});

Custom Metrics

Define custom metrics in your services:

import { metrics } from '@opentelemetry/api';

const meter = metrics.getMeter('qms-api');

// Counter
const requestCounter = meter.createCounter('http_requests_total', {
  description: 'Total HTTP requests',
});

// Histogram
const requestDuration = meter.createHistogram('http_request_duration_seconds', {
  description: 'HTTP request duration in seconds',
});

// Use in controller
export class EmployeeController {
  @Get()
  getEmployees() {
    const start = performance.now();

    // ... get employees ...

    const duration = (performance.now() - start) / 1000;
    this.requestDuration.record(duration);
    this.requestCounter.add(1);

    return employees;
  }
}

Custom Spans (Tracing)

import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('qms-api');

export class UserService {
  async createUser(data: CreateUserDto) {
    const span = tracer.startSpan('user.create');

    try {
      span.setAttributes({
        'user.email': data.email,
        'user.department': data.department,
      });

      const user = await this.prisma.user.create({ data });
      span.setStatus({ code: SpanStatusCode.OK });
      return user;
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      throw error;
    } finally {
      span.end();
    }
  }
}

Prometheus Queries

Common PromQL Queries

# Request rate (requests per second)
rate(http_requests_total[5m])

# Request duration 95th percentile
histogram_quantile(0.95, http_request_duration_seconds_bucket)

# Error rate percentage
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100

# API endpoint response time
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

# Memory usage
process_resident_memory_bytes

# Database connection pool utilization
db_connection_pool_utilization

# CPU usage
rate(process_cpu_seconds_total[5m]) * 100

Grafana Dashboards

Creating Dashboards

Access Grafana: http://localhost:3000/grafana
Login: admin / admin
Create → Dashboard
Add Panel
Select Prometheus as data source
Write PromQL query

Example Dashboard JSON

{
  "dashboard": {
    "title": "QMS API Overview",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
          }
        ],
        "type": "stat"
      }
    ]
  }
}

Alerting

Define Alerts in Prometheus

Edit prometheus.yml:

rule_files:
  - /etc/prometheus/alert.rules.yml

alert:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

Create alert.rules.yml:

groups:
  - name: api_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        annotations:
          summary: 'High error rate detected'
          description: 'Error rate is {{ $value | humanizePercentage }}'

      - alert: APIDown
        expr: up{job="api"} == 0
        for: 1m
        annotations:
          summary: 'API is down'

Grafana Alert Notifications

Configuration → Notification channels
Add channel (Slack, Email, PagerDuty, etc.)
Use in alert rules

Performance Monitoring

Key Metrics to Monitor

Metric	Normal Range	Alert Threshold
Request Latency (p95)	< 200ms	> 500ms
Error Rate	< 0.1%	> 1%
CPU Usage	< 50%	> 80%
Memory Usage	< 60%	> 85%
Disk Usage	< 70%	> 90%
Database Connections	< 80%	> 95%

Example: Track API Performance

Dashboard panels:

Request Volume (5m rate)
Response Time (p50, p95, p99)
Error Rate
Database Query Duration
Memory Usage

Troubleshooting

Prometheus Scrape Issues

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets

# Verify metrics endpoint
curl http://localhost:3001/metrics

No Data in Grafana

# Check data source connection
Grafana → Configuration → Data Sources → Prometheus → Test

# Verify metrics are being collected
docker-compose logs prometheus | grep "scrape"

High Memory Usage

# Check Prometheus storage
du -sh /prometheus

# Adjust retention policy in prometheus.yml
global:
  retention: 7d  # default is 15d

Missing Traces

# Check OTEL collector configuration
docker-compose logs otel-collector

# Verify OTEL endpoint reachability
telnet localhost 4317

# Check application logs for OTEL errors
docker-compose logs api | grep -i otel

Best Practices

1. Use Meaningful Metric Names

// ✅ GOOD: Clear, specific
const orderProcessingDuration = meter.createHistogram('order_processing_duration_seconds');

// ❌ AVOID: Too generic
const duration = meter.createHistogram('duration');

2. Add Context with Attributes

span.setAttributes({
  'user.id': userId,
  'order.id': orderId,
  'operation.type': 'checkout',
});

3. Monitor Business Metrics

// Application-level metrics
const ordersCreated = meter.createCounter('orders_created_total');
const checkoutValue = meter.createHistogram('checkout_value');

ordersCreated.add(1);
checkoutValue.record(orderTotal);

4. Set Appropriate Alert Thresholds

Monitor trends over time before setting alerts
Avoid alert fatigue with realistic thresholds
Include runbooks in alert descriptions

5. Keep Retention Manageable

# Balance data retention vs. storage
global:
  retention: 15d # Adjust based on storage capacity

Data Export

Export Metrics

# Query and export as JSON
curl 'http://localhost:9090/api/v1/query?query=up' > metrics.json

# Backup Prometheus data
docker cp container_id:/prometheus ./prometheus-backup

Grafana Dashboards

# Export dashboard as JSON
Grafana Dashboard Menu → Share → Export JSON

# Version control dashboards
git add-commit dashboards/*.json

Advanced Topics

Quick Reference

# Start observability stack
make up-lgtm

# View Prometheus targets
http://localhost:9090/targets

# Query metrics
http://localhost:9090/graph

# Access Grafana
http://localhost:3000/grafana

# Restart services
docker-compose restart prometheus grafana

# View logs
docker-compose logs -f otel-collector

Overview​

What is Observability?​

Stack Components​

Prometheus​

Grafana​

OpenTelemetry (OTEL)​

Starting the Observability Stack​

Start Everything​

Access Points​

Configuration Files​

infra/observability/prometheus.yml​

infra/observability/otelcol-config.yaml​

Instrumenting Your Application​

NestJS API Instrumentation​

Custom Metrics​

Custom Spans (Tracing)​

Prometheus Queries​

Common PromQL Queries​

Grafana Dashboards​

Creating Dashboards​

Example Dashboard JSON​

Alerting​

Define Alerts in Prometheus​

Grafana Alert Notifications​

Performance Monitoring​

Key Metrics to Monitor​

Example: Track API Performance​

Troubleshooting​

Prometheus Scrape Issues​

No Data in Grafana​

High Memory Usage​

Missing Traces​

Best Practices​

1. Use Meaningful Metric Names​

2. Add Context with Attributes​

3. Monitor Business Metrics​

4. Set Appropriate Alert Thresholds​

5. Keep Retention Manageable​

Data Export​

Export Metrics​

Grafana Dashboards​

Advanced Topics​

Quick Reference​

Related Documentation​

Overview

What is Observability?

Stack Components

Prometheus

Grafana

OpenTelemetry (OTEL)

Starting the Observability Stack

Start Everything

Access Points

Configuration Files

`infra/observability/prometheus.yml`

`infra/observability/otelcol-config.yaml`

Instrumenting Your Application

NestJS API Instrumentation

Custom Metrics

Custom Spans (Tracing)

Prometheus Queries

Common PromQL Queries

Grafana Dashboards

Creating Dashboards

Example Dashboard JSON

Alerting

Define Alerts in Prometheus

Grafana Alert Notifications

Performance Monitoring

Key Metrics to Monitor

Example: Track API Performance

Troubleshooting

Prometheus Scrape Issues

No Data in Grafana

High Memory Usage

Missing Traces

Best Practices

1. Use Meaningful Metric Names

2. Add Context with Attributes

3. Monitor Business Metrics

4. Set Appropriate Alert Thresholds

5. Keep Retention Manageable

Data Export

Export Metrics

Grafana Dashboards

Advanced Topics

Quick Reference

Related Documentation