Skip to main content
All topics

Meta-Learner (AI Orchestrator)

How the self-learning engine adapts strategy weights in real time.

What is the Meta-Learner?

The Meta-Learner is Smartbull's self-learning orchestrator. Instead of using fixed portfolio weights, it uses Thompson Sampling (a Bayesian multi-armed bandit algorithm) to continuously learn which strategy sleeves perform best in the current market regime.

Every tick, the Meta-Learner: - Observes each sleeve's recent PnL (win/loss days) - Updates a Beta distribution per sleeve (alpha = wins, beta = losses) - Samples from each distribution to generate "optimism scores" - Allocates capital proportionally to the sampled scores - Applies constraints: min 5% per active sleeve, max 35% single sleeve, hedge sleeves always get minimum allocation

This means strategies that are currently winning get more capital, and strategies that are underperforming get less — automatically, without any manual intervention.

Contextual Bandits (LinUCB) — the upgrade

Beyond pure Thompson Sampling, the system now supports LinUCB (Contextual Bandits) — a more sophisticated algorithm that models expected reward as a function of market context:

Context features (7-dimensional vector): - BTC regime state (one-hot: risk-on, chop, risk-off) - Realized volatility 7d and 30d (normalized) - BTC perpetual funding rate (normalized) - LunarCrush Galaxy Score (normalized) - Average pairwise market correlation (normalized) - Hour of day (cyclical sin/cos encoding)

LinUCB maintains per-sleeve matrices (A, b) that are updated online after each reward observation. The UCB exploration bonus ensures the system keeps exploring undersampled sleeves while exploiting known winners.

Gated behind app_runtime_config.contextual_bandits_enabled. When disabled, falls back to Thompson Sampling.

Online Learning — real-time updates

Instead of waiting for daily batch updates, the Meta-Learner now performs per-fill incremental learning:

  • After every successful fill, the system checks if the fill price was favorable vs mark price
  • Thompson Sampling: immediately increments alpha (win) or beta (loss)
  • LinUCB: updates the A matrix and b vector with current context and micro-reward
  • Updates are batched in memory and persisted every 10 updates or 60 seconds

This means the system adapts within minutes of market regime changes, not hours.

Retrieval-Augmented context (Pinecone)

Before making allocation decisions, the Meta-Learner queries a Pinecone vector database of past decisions and their outcomes. It finds the 5 most similar historical market conditions (by regime, volatility, correlation structure) and blends their allocation outcomes into the current decision.

This combination means the system learns both from recent direct experience and from analogous historical situations — a form of meta-learning that adapts faster than pure online learning.