Cartly /
runbooks / MONITOR-001-Monitoring
MONITOR-001: Monitoring & Observability Runbook
Status: Draft — Pending Observability Stack Implementation
Owner: Development + CTO
Review Cycle: Monthly after go-live
Last Updated: 2026-07-03
Overview
Dieses Runbook definiert das Monitoring- und Alerting-Setup für Cartly. Es deckt Infrastructure Monitoring, Application Performance Monitoring (APM), Error Tracking und Business Metrics ab.
Hinweis: Sprint 1 definiert den Observability-Stack. Konkrete Alert-Schwellenwerte werden nach Implementierung eingetragen.
1. Monitoring Stack
1.1 Selected Tools
|
Tool |
Purpose |
Tier |
| Sentry |
Error Tracking |
Critical |
|
| Highlight.io |
Session Replay / Frontend APM |
High |
|
| healthchecks.io |
Cron Job & Backup Monitoring |
High |
|
| Railway Metrics |
Infrastructure (CPU, Memory, Network) |
High |
|
| PostgreSQL Analytics |
Database Performance |
Medium |
|
| Upstash Redis |
Cache Hit Rates, Memory |
Medium |
|
| OpenRouter |
AI API Usage & Costs |
Medium |
|
1.2 Architecture
┌─────────────────────────────────────────────────────────────┐
│ Monitoring Stack │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Sentry │ │ Highlight.io │ │ healthchecks │ │
│ │ Error/Perf │ │ SessionReplay │ │ Cron/Backup │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Railway │ │ PostgreSQL │ │ Upstash │ │
│ │ Infra Metrics│ │ pg_stat_* │ │ Redis Monitor│ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────┐ │
│ │ OpenRouter │ │
│ │ AI Usage/Cost│ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
2. Sentry Configuration
2.1 Projects
| Project |
Scope |
DSN |
cartly-api |
Backend (Fastify + Prisma) |
TBD |
cartly-web |
Frontend (React PWA) |
TBD |
2.2 Alert Rules
| Alert |
Condition |
Severity |
Notify |
| New Issue |
Any new error fingerprint |
SEV-2 |
Slack #cartly-alerts |
| P1 Error Spike |
>10 errors in 5min |
SEV-1 |
Slack #cartly-alerts + PagerDuty |
| Performance Degradation |
p95 latency >2s |
SEV-2 |
Slack #cartly-alerts |
| Error Rate |
>1% of requests |
SEV-2 |
Slack #cartly-alerts |
2.3 Release Tracking
# GitHub Action: Upload sourcemaps to Sentry
- name: Upload sourcemaps
run: |
npx sentry-cli releases deploys "$GITHUB_SHA" \
--org cartly \
--project cartly-api \
--env production
3. Healthchecks.io Configuration
3.1 Monitored Services
| Check |
Schedule |
Timeout |
Expected |
daily-backup |
0 3 * * * |
1h |
200 |
weekly-full-backup |
0 1 * * 0 |
2h |
200 |
backup-verify |
0 4 1 * * |
4h |
200 |
db-pitr-wal |
*/15 * * * * |
15min |
200 |
3.2 Alerting
# healthchecks.io integration
failure_emails:
- dev@cartly.io
- cto@cartly.io
4. Railway Metrics Alerts
4.1 Thresholds
| Metric |
Warning |
Critical |
Action |
| CPU Usage |
>70% |
>90% |
Scale up or investigate |
| Memory Usage |
>75% |
>90% |
Scale up or investigate |
| Disk Usage |
>80% |
>95% |
Cleanup or extend volume |
| Request Rate |
>1000 RPM |
>5000 RPM |
Rate limit or scale |
| Response Time p95 |
>500ms |
>2000ms |
Profile / optimize |
| Error Rate |
>0.5% |
>2% |
Investigate errors |
4.2 Railway Alerts Configuration
# Via Railway CLI / Dashboard
railway alerts add --metric cpu --threshold 90 --severity critical
railway alerts add --metric memory --threshold 90 --severity critical
5. PostgreSQL Monitoring
5.1 Key Metrics
| Metric |
Query |
Healthy |
| Connection Count |
SELECT count(*) FROM pg_stat_activity; |
<80% of max_connections |
| Cache Hit Ratio |
SELECT sum(blks_hit)*100/sum(blks_hit+blks_read) FROM pg_stat_database; |
>95% |
| Long Running Queries |
SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - query_start > interval '5 minutes'; |
0 |
| Replication Lag |
SELECT now() - pg_last_xact_replay_timestamp() AS lag; |
<30s |
5.2 Slow Query Log
-- postgresql.conf
log_min_duration_statement = 1000 -- Log queries > 1s
log_lock_waits = on
log_temp_files = 0
6. Redis (Upstash) Monitoring
6.1 Key Metrics
| Metric |
Healthy |
Alert |
| Memory Usage |
<70% of quota |
>85% |
| Cache Hit Rate |
>90% |
<80% |
| Connected Clients |
<80% of limit |
>90% |
| Evictions |
0 |
>0 |
7. OpenRouter AI Monitoring
7.1 Metrics to Track
| Metric |
Purpose |
Alert |
| Requests per Day |
Usage/Cost |
>10k/day |
| Cost per Day |
Budget |
>$50/day |
| Error Rate |
Provider Issues |
>1% |
| Latency p95 |
Performance |
>10s |
| Token Usage |
Cost Breakdown |
Near monthly limit |
7.2 Budget Alerts
# OpenRouter Dashboard or API
# Set daily/monthly budget alerts at 80% threshold
8. Dashboard Overview
8.1 Recommended Dashboards
- Sentry Overview Dashboard — Error rates, trends, top issues
- Railway Production Dashboard — Real-time CPU, memory, requests
- PostgreSQL pgAdmin Dashboard — Query performance, connections
- Custom Grafana Dashboard — Cross-cutting view (optional)
8.2 Daily Health Check
# Automated via healthchecks.io or manual
□ Sentry: No new SEV-1 issues
□ Railway: All services green
□ Database: No long-running queries
□ Redis: Hit rate >90%
□ Backups: All checks passed
□ OpenRouter: Error rate <1%
9. Alert Response Procedures
SEV-1 (Critical)
- Acknowledge — Immediately acknowledge in PagerDuty/Slack
- Assess — Is the production system affected?
- Communicate — Post in #cartly-alerts: "SEV-1: [Brief description] — investigating"
- Escalate — Call CTO if not resolved within 15min
- Resolve — Deploy fix or rollback
- Post-mortem — Required within 48h
SEV-2 (Warning)
- Acknowledge — Within 30min during business hours
- Investigate — Identify root cause
- Communicate — Post status in #cartly-alerts
- Resolve — Within 4h or escalate to SEV-1
SEV-3 (Info)
- Track — Log in issue tracker
- Schedule — Fix in next sprint
10. Open Items
11. Emergency Contacts
| Situation |
Kontakt |
Reaktionszeit |
| Production Down |
DEV (9f66dba7) + CTO (b999c0b2) |
Sofort |
| Monitoring System Unavailable |
DEV (9f66dba7) |
Innerhalb 1h |
| Backup Alert |
DEV (9f66dba7) |
Innerhalb 1h |
Erstellt: 2026-07-03 von Documentation Agent (a66674bf)
Review: Nach Sprint-1-Abschluss durch CTO + DEV