ADR-005: Monitoring & Observability Stack

Status: Accepted
Date: 2026-07-03
Deciders: CTO (b999c0b2), DEV (9f66dba7)

Context

Cartly muss produktionsreif überwacht werden. Bei Incidents muss das Team schnell root-cause finden können. Bei einem SaaS-Produkt mit Retail-Kunden sind uptime und correct data critical.

Anforderungen:

Error Tracking: Unhandled Exceptions, API Fehler
Performance Monitoring: Latenz, Throughput, DB-Queries
**Uptime Monitoring:**外部 Checks, Alerting bei Ausfall
Logging: Zentralisiert, durchsuchbar, retention-pflichtig (DSGVO)
Alerting: Escalation bei SEV-1/2 Incidents, keine Alert-Fatigue

Decision

Stack:

Purpose	Tool	Tier	Reason
Error Tracking	Sentry	Free → Pay	Beste JS/Node Integration, Source Maps, DSGVO-konform
Uptime Monitoring	Better Uptime	Free	60 Checks/min, Status Page inklusive
Log Aggregation	Self-hosted (Grafana Loki)	Free	DSGVO-konform (kein US-Cloud), skalierbar
Metrics	Grafana + Prometheus	Free	Standard, gute Railway/Neon Integration
Status Page	Better Uptime Public Status	Free	Kunden-facing, automatische Updates
Alerts	PagerDuty / Slack	Pay	CI-Integration, Escalation Chains

Warum Sentry + Better Uptime:

Sentry: Kein eigenes Hosting nötig, günstig für MVP, DSGVO-Data Residency Option (EU)
Better Uptime: Decoupled von Cloud-Providern, präzise Checks (HTTP/TCP/SSH), Status Page ist kostenlos
Grafana Stack: Self-hosted = DSGVO-konform, kein Vendor-Lock-in für Logs

Warum NICHT Datadog/New Relic:

Zu teuer für Startup (Datadog ~$2k/Monat bei Production-Scale)
Zu viele Features = Alert-Fatigue
Overkill für MVP

Consequences

Positiv:

Sentry fängt Frontend + Backend Errors zentral ab
Grafana Loki self-hosted = DSGVO-konforme Log-Speicherung
Better Uptime Status Page schafft Vertrauen bei Kunden

Negativ:

Self-hosted Grafana = Wartungs-Aufwand (DEV muss Updaten)
Better Uptime kostenlos nur 60 Checks/min, bei mehreren Services ggf. Upgrade nötig
Alert-Fatigue möglich wenn nicht richtig konfiguriert

Monitoring Requirements (Cartly-spezifisch)

API:
  - Error Rate: > 1% → SEV-2 Alert
  - P99 Latency: > 2s → SEV-3 Alert
  - Health Check failures: > 3 → SEV-2 Alert

Database (Neon):
  - Connection Pool: > 80% utilized → SEV-3
  - Slow Queries: > 1s → SEV-3
  - Replication Lag: > 5s → SEV-2

Business:
  - Failed Payments: > 0 → SEV-1
  - Auth Failures (brute force): > 10/min → SEV-2
  - Data Export Requests: DSGVO-Compliance-Tracking

Accepted: 2026-07-03