MONITOR-001: Monitoring & Observability Runbook

Status: Draft — Pending Observability Stack Implementation
Owner: Development + CTO
Review Cycle: Monthly after go-live
Last Updated: 2026-07-03

Overview

Dieses Runbook definiert das Monitoring- und Alerting-Setup für Cartly. Es deckt Infrastructure Monitoring, Application Performance Monitoring (APM), Error Tracking und Business Metrics ab.

Hinweis: Sprint 1 definiert den Observability-Stack. Konkrete Alert-Schwellenwerte werden nach Implementierung eingetragen.

1. Monitoring Stack

1.1 Selected Tools

	Tool	Purpose
Sentry	Error Tracking	Critical
Highlight.io	Session Replay / Frontend APM	High
healthchecks.io	Cron Job & Backup Monitoring	High
Railway Metrics	Infrastructure (CPU, Memory, Network)	High
PostgreSQL Analytics	Database Performance	Medium
Upstash Redis	Cache Hit Rates, Memory	Medium
OpenRouter	AI API Usage & Costs	Medium

1.2 Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Monitoring Stack                         │
│                                                             │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐ │
│  │    Sentry    │    │ Highlight.io  │    │ healthchecks │ │
│  │ Error/Perf   │    │ SessionReplay │    │ Cron/Backup  │ │
│  └──────────────┘    └──────────────┘    └──────────────┘ │
│                                                             │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐ │
│  │   Railway    │    │  PostgreSQL   │    │   Upstash    │ │
│  │ Infra Metrics│    │  pg_stat_*   │    │ Redis Monitor│ │
│  └──────────────┘    └──────────────┘    └──────────────┘ │
│                                                             │
│  ┌──────────────┐                                          │
│  │  OpenRouter  │                                          │
│  │ AI Usage/Cost│                                          │
│  └──────────────┘                                          │
└─────────────────────────────────────────────────────────────┘

2. Sentry Configuration

2.1 Projects

Project	Scope	DSN
`cartly-api`	Backend (Fastify + Prisma)	TBD
`cartly-web`	Frontend (React PWA)	TBD

2.2 Alert Rules

Alert	Condition	Severity	Notify
New Issue	Any new error fingerprint	SEV-2	Slack #cartly-alerts
P1 Error Spike	>10 errors in 5min	SEV-1	Slack #cartly-alerts + PagerDuty
Performance Degradation	p95 latency >2s	SEV-2	Slack #cartly-alerts
Error Rate	>1% of requests	SEV-2	Slack #cartly-alerts

2.3 Release Tracking

# GitHub Action: Upload sourcemaps to Sentry
- name: Upload sourcemaps
  run: |
    npx sentry-cli releases deploys "$GITHUB_SHA" \
      --org cartly \
      --project cartly-api \
      --env production

3. Healthchecks.io Configuration

3.1 Monitored Services

Check	Schedule	Timeout	Expected
`daily-backup`	`0 3 * * *`	1h	200
`weekly-full-backup`	`0 1 * * 0`	2h	200
`backup-verify`	`0 4 1 * *`	4h	200
`db-pitr-wal`	`/15 * * *`	15min	200

3.2 Alerting

# healthchecks.io integration
failure_emails:
  - dev@cartly.io
  - cto@cartly.io

4. Railway Metrics Alerts

4.1 Thresholds

Metric	Warning	Critical	Action
CPU Usage	>70%	>90%	Scale up or investigate
Memory Usage	>75%	>90%	Scale up or investigate
Disk Usage	>80%	>95%	Cleanup or extend volume
Request Rate	>1000 RPM	>5000 RPM	Rate limit or scale
Response Time p95	>500ms	>2000ms	Profile / optimize
Error Rate	>0.5%	>2%	Investigate errors

4.2 Railway Alerts Configuration

# Via Railway CLI / Dashboard
railway alerts add --metric cpu --threshold 90 --severity critical
railway alerts add --metric memory --threshold 90 --severity critical

5. PostgreSQL Monitoring

5.1 Key Metrics

Metric	Query	Healthy
Connection Count	`SELECT count(*) FROM pg_stat_activity;`	<80% of max_connections
Cache Hit Ratio	`SELECT sum(blks_hit)*100/sum(blks_hit+blks_read) FROM pg_stat_database;`	>95%
Long Running Queries	`SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - query_start > interval '5 minutes';`	0
Replication Lag	`SELECT now() - pg_last_xact_replay_timestamp() AS lag;`	<30s

5.2 Slow Query Log

-- postgresql.conf
log_min_duration_statement = 1000  -- Log queries > 1s
log_lock_waits = on
log_temp_files = 0

6. Redis (Upstash) Monitoring

6.1 Key Metrics

Metric	Healthy	Alert
Memory Usage	<70% of quota	>85%
Cache Hit Rate	>90%	<80%
Connected Clients	<80% of limit	>90%
Evictions	0	>0

7. OpenRouter AI Monitoring

7.1 Metrics to Track

Metric	Purpose	Alert
Requests per Day	Usage/Cost	>10k/day
Cost per Day	Budget	>$50/day
Error Rate	Provider Issues	>1%
Latency p95	Performance	>10s
Token Usage	Cost Breakdown	Near monthly limit

7.2 Budget Alerts

# OpenRouter Dashboard or API
# Set daily/monthly budget alerts at 80% threshold

8. Dashboard Overview

8.1 Recommended Dashboards

Sentry Overview Dashboard — Error rates, trends, top issues
Railway Production Dashboard — Real-time CPU, memory, requests
PostgreSQL pgAdmin Dashboard — Query performance, connections
Custom Grafana Dashboard — Cross-cutting view (optional)

8.2 Daily Health Check

# Automated via healthchecks.io or manual
□ Sentry: No new SEV-1 issues
□ Railway: All services green
□ Database: No long-running queries
□ Redis: Hit rate >90%
□ Backups: All checks passed
□ OpenRouter: Error rate <1%

9. Alert Response Procedures

SEV-1 (Critical)

Acknowledge — Immediately acknowledge in PagerDuty/Slack
Assess — Is the production system affected?
Communicate — Post in #cartly-alerts: "SEV-1: [Brief description] — investigating"
Escalate — Call CTO if not resolved within 15min
Resolve — Deploy fix or rollback
Post-mortem — Required within 48h

SEV-2 (Warning)

Acknowledge — Within 30min during business hours
Investigate — Identify root cause
Communicate — Post status in #cartly-alerts
Resolve — Within 4h or escalate to SEV-1

SEV-3 (Info)

Track — Log in issue tracker
Schedule — Fix in next sprint

10. Open Items

Sentry Projekt für cartly-api + cartly-web einrichten — DEV
healthchecks.io Monitor für Backups konfigurieren — DEV
Railway Alerts konfigurieren — DEV
PostgreSQL Slow Query Log aktivieren — DEV
OpenRouter Budget Alerts einrichten — DEV/PM
Incident Response Dashboard in Grafana erstellen — DEV

11. Emergency Contacts

Situation	Kontakt	Reaktionszeit
Production Down	DEV (9f66dba7) + CTO (b999c0b2)	Sofort
Monitoring System Unavailable	DEV (9f66dba7)	Innerhalb 1h
Backup Alert	DEV (9f66dba7)	Innerhalb 1h

Erstellt: 2026-07-03 von Documentation Agent (a66674bf)
Review: Nach Sprint-1-Abschluss durch CTO + DEV