Observability & SLOs
Metrics, traces, logs, alerts, runbooks — what we watch and what wakes us up.
What we track
Every bot tick, exchange call, order, reconcile run and ML prediction emits structured metrics + traces + logs.
- Metrics: Cloudflare Analytics Engine, plus Prometheus exporter at
/api/public/metrics, plus push to Grafana Cloud Mimir (cron, every minute). Dashboards available to admins. - Traces: W3C trace IDs, ALS-based spans, dual-emit to AE + ClickHouse
server_traces. Waterfall view available to admins. - Logs: structured logger auto-attaches
trace_id/span_id. Searchable by admins.
SLOs and paging
We page on real customer-visible problems, not noise. Active SLOs:
- Bot tick error rate / latency
- Exchange call error rate
- Reconciler latency
- Server-side error rate
- Stuck-order spike (>5 in 10min)
- Ghost-position detection (qty drift >5%)
- Decision-to-order p99 latency (>500ms warn, >1s critical)
- Synthetic E2E probe failure
- OHLCV integrity (gaps, NaN, jumps >1%)
- Payment-orphan detected
- Speculative-execution hit rate (<40%/24h)
Alerts page via PagerDuty with on-call rotation. Each SLO has a linked runbook.
Kill switch & disaster recovery
A two-person kill-switch is wired into the admin console. Activating halts all bot ticks instantly. Deactivating requires a second admin to approve — one person can't restart trading alone.
DR: Supabase ↔ Neon warm standby. DB health probe every minute. Auto-flip to replica after 3 consecutive failures (sticky — manual recover only). Daily DR snapshot at 03:30 UTC. Weekly DR drill cron measures real RTO / RPO; current numbers: RTO 1s, RPO 58s (well under 15min / 5min targets).