MONITOR-001: Monitoring & Observability Runbook

Status: Draft — Pending Observability Stack Implementation
Owner: Development + CTO
Review Cycle: Monthly after go-live
Last Updated: 2026-07-03


Overview

Dieses Runbook definiert das Monitoring- und Alerting-Setup für Cartly. Es deckt Infrastructure Monitoring, Application Performance Monitoring (APM), Error Tracking und Business Metrics ab.

Hinweis: Sprint 1 definiert den Observability-Stack. Konkrete Alert-Schwellenwerte werden nach Implementierung eingetragen.


1. Monitoring Stack

1.1 Selected Tools

Tool Purpose Tier
Sentry Error Tracking Critical
Highlight.io Session Replay / Frontend APM High
healthchecks.io Cron Job & Backup Monitoring High
Railway Metrics Infrastructure (CPU, Memory, Network) High
PostgreSQL Analytics Database Performance Medium
Upstash Redis Cache Hit Rates, Memory Medium
OpenRouter AI API Usage & Costs Medium

1.2 Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Monitoring Stack                         │
│                                                             │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐ │
│  │    Sentry    │    │ Highlight.io  │    │ healthchecks │ │
│  │ Error/Perf   │    │ SessionReplay │    │ Cron/Backup  │ │
│  └──────────────┘    └──────────────┘    └──────────────┘ │
│                                                             │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐ │
│  │   Railway    │    │  PostgreSQL   │    │   Upstash    │ │
│  │ Infra Metrics│    │  pg_stat_*   │    │ Redis Monitor│ │
│  └──────────────┘    └──────────────┘    └──────────────┘ │
│                                                             │
│  ┌──────────────┐                                          │
│  │  OpenRouter  │                                          │
│  │ AI Usage/Cost│                                          │
│  └──────────────┘                                          │
└─────────────────────────────────────────────────────────────┘

2. Sentry Configuration

2.1 Projects

Project Scope DSN
cartly-api Backend (Fastify + Prisma) TBD
cartly-web Frontend (React PWA) TBD

2.2 Alert Rules

Alert Condition Severity Notify
New Issue Any new error fingerprint SEV-2 Slack #cartly-alerts
P1 Error Spike >10 errors in 5min SEV-1 Slack #cartly-alerts + PagerDuty
Performance Degradation p95 latency >2s SEV-2 Slack #cartly-alerts
Error Rate >1% of requests SEV-2 Slack #cartly-alerts

2.3 Release Tracking

# GitHub Action: Upload sourcemaps to Sentry
- name: Upload sourcemaps
  run: |
    npx sentry-cli releases deploys "$GITHUB_SHA" \
      --org cartly \
      --project cartly-api \
      --env production

3. Healthchecks.io Configuration

3.1 Monitored Services

Check Schedule Timeout Expected
daily-backup 0 3 * * * 1h 200
weekly-full-backup 0 1 * * 0 2h 200
backup-verify 0 4 1 * * 4h 200
db-pitr-wal */15 * * * * 15min 200

3.2 Alerting

# healthchecks.io integration
failure_emails:
  - dev@cartly.io
  - cto@cartly.io

4. Railway Metrics Alerts

4.1 Thresholds

Metric Warning Critical Action
CPU Usage >70% >90% Scale up or investigate
Memory Usage >75% >90% Scale up or investigate
Disk Usage >80% >95% Cleanup or extend volume
Request Rate >1000 RPM >5000 RPM Rate limit or scale
Response Time p95 >500ms >2000ms Profile / optimize
Error Rate >0.5% >2% Investigate errors

4.2 Railway Alerts Configuration

# Via Railway CLI / Dashboard
railway alerts add --metric cpu --threshold 90 --severity critical
railway alerts add --metric memory --threshold 90 --severity critical

5. PostgreSQL Monitoring

5.1 Key Metrics

Metric Query Healthy
Connection Count SELECT count(*) FROM pg_stat_activity; <80% of max_connections
Cache Hit Ratio SELECT sum(blks_hit)*100/sum(blks_hit+blks_read) FROM pg_stat_database; >95%
Long Running Queries SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - query_start > interval '5 minutes'; 0
Replication Lag SELECT now() - pg_last_xact_replay_timestamp() AS lag; <30s

5.2 Slow Query Log

-- postgresql.conf
log_min_duration_statement = 1000  -- Log queries > 1s
log_lock_waits = on
log_temp_files = 0

6. Redis (Upstash) Monitoring

6.1 Key Metrics

Metric Healthy Alert
Memory Usage <70% of quota >85%
Cache Hit Rate >90% <80%
Connected Clients <80% of limit >90%
Evictions 0 >0

7. OpenRouter AI Monitoring

7.1 Metrics to Track

Metric Purpose Alert
Requests per Day Usage/Cost >10k/day
Cost per Day Budget >$50/day
Error Rate Provider Issues >1%
Latency p95 Performance >10s
Token Usage Cost Breakdown Near monthly limit

7.2 Budget Alerts

# OpenRouter Dashboard or API
# Set daily/monthly budget alerts at 80% threshold

8. Dashboard Overview

8.1 Recommended Dashboards

  1. Sentry Overview Dashboard — Error rates, trends, top issues
  2. Railway Production Dashboard — Real-time CPU, memory, requests
  3. PostgreSQL pgAdmin Dashboard — Query performance, connections
  4. Custom Grafana Dashboard — Cross-cutting view (optional)

8.2 Daily Health Check

# Automated via healthchecks.io or manual
□ Sentry: No new SEV-1 issues
□ Railway: All services green
□ Database: No long-running queries
□ Redis: Hit rate >90%
□ Backups: All checks passed
□ OpenRouter: Error rate <1%

9. Alert Response Procedures

SEV-1 (Critical)

  1. Acknowledge — Immediately acknowledge in PagerDuty/Slack
  2. Assess — Is the production system affected?
  3. Communicate — Post in #cartly-alerts: "SEV-1: [Brief description] — investigating"
  4. Escalate — Call CTO if not resolved within 15min
  5. Resolve — Deploy fix or rollback
  6. Post-mortem — Required within 48h

SEV-2 (Warning)

  1. Acknowledge — Within 30min during business hours
  2. Investigate — Identify root cause
  3. Communicate — Post status in #cartly-alerts
  4. Resolve — Within 4h or escalate to SEV-1

SEV-3 (Info)

  1. Track — Log in issue tracker
  2. Schedule — Fix in next sprint

10. Open Items


11. Emergency Contacts

Situation Kontakt Reaktionszeit
Production Down DEV (9f66dba7) + CTO (b999c0b2) Sofort
Monitoring System Unavailable DEV (9f66dba7) Innerhalb 1h
Backup Alert DEV (9f66dba7) Innerhalb 1h

Erstellt: 2026-07-03 von Documentation Agent (a66674bf)
Review: Nach Sprint-1-Abschluss durch CTO + DEV