Observability & SLOs

Metrics, traces, logs, alerts, runbooks — what we watch and what wakes us up.

What we track

Every bot tick, exchange call, order, reconcile run and ML prediction emits structured metrics + traces + logs.

Metrics: Cloudflare Analytics Engine, plus Prometheus exporter at /api/public/metrics, plus push to Grafana Cloud Mimir (cron, every minute). Dashboards available to admins.
Traces: W3C trace IDs, ALS-based spans, dual-emit to AE + ClickHouse server_traces. Waterfall view available to admins.
Logs: structured logger auto-attaches trace_id / span_id. Searchable by admins.

SLOs and paging

We page on real customer-visible problems, not noise. Active SLOs:

Bot tick error rate / latency
Exchange call error rate
Reconciler latency
Server-side error rate
Stuck-order spike (>5 in 10min)
Ghost-position detection (qty drift >5%)
Decision-to-order p99 latency (>500ms warn, >1s critical)
Synthetic E2E probe failure
OHLCV integrity (gaps, NaN, jumps >1%)
Payment-orphan detected
Speculative-execution hit rate (<40%/24h)

Alerts page via PagerDuty with on-call rotation. Each SLO has a linked runbook.

Kill switch & disaster recovery

A two-person kill-switch is wired into the admin console. Activating halts all bot ticks instantly. Deactivating requires a second admin to approve — one person can't restart trading alone.

DR: Supabase ↔ Neon warm standby. DB health probe every minute. Auto-flip to replica after 3 consecutive failures (sticky — manual recover only). Daily DR snapshot at 03:30 UTC. Weekly DR drill cron measures real RTO / RPO; current numbers: RTO 1s, RPO 58s (well under 15min / 5min targets).