5. Multi-Armed Bandits with Probing

5. Multi-Armed Bandits with Probing#

Multi-Armed Bandits with Probing: Paper Explained by Dr. Fangli Ying#

UCBP and Its Theoretical Guarantees#

Authors: Eray Can Elumar, Cem Tekin, Osman Yağan
Institution: Carnegie Mellon University & Bilkent University

Introduction to Multi-Armed Bandits (MAB)#

Classic MAB Problem: A decision-maker (agent) selects from \(K\) arms, each with an unknown reward distribution.
Goal: Maximize cumulative reward over \(T\) rounds by balancing exploration (learning arm rewards) and exploitation (choosing known high-reward arms).
Regret: Traditional regret = (Optimal arm’s total reward) - (Agent’s total reward).
Novel Extension: MAB with Probing
- Agent can probe an arm (observe its reward) at cost \(c \geq 0\) before deciding to pull it or a backup arm.
- Probing adds flexibility but complicates action space (now \(O(K^2)\) actions).

Key Applications#

Hyperparameter Optimization:
- “Pull” = Run a model with a hyperparameter setting (no oversight).
- “Probe” = Run with expert supervision (terminate poor runs early, paying \(c\) for expert time).
Online Learning with ML Advice:
- Probing = Using ML predictions to estimate rewards (cost \(c\) = computation expense).

Key Applications#

Wireless Communications:
- Probing = Sending small packets to check channel quality before transmission.
Healthcare (ER Queuing):
- Probing = Preliminary tests to assess patient urgency; “pull” = full treatment.

Problem Formulation#

Arms: \(K\) arms, each with reward distribution \(\Gamma_i\), mean \(\mu_i\).
Actions:
- Pull: Directly pull arm \(i\), get reward \(r_i \sim \Gamma_i\). Denoted as \((i, \emptyset)\).
- Probe: Choose probe arm \(i\) and backup arm \(j \neq i\):
  - Observe \(r_i \sim \Gamma_i\).
  - Pull \(i\) (reward \(r_i - c\)) if \(r_i\) is good; else pull \(j\) (reward \(r_j - c\)). Denoted as \((i, j)\).
Action Set: \(A = A_s \cup A_p\), where \(A_s\) (pulls) has size \(K\), \(A_p\) (probes) has size \(K(K-1)\). Total: \(|A| = K^2\).

Optimal Action and Regret Definition#

Optimal Action: Maximizes expected reward \(\nu^*\), where:
- For pull \((i, \emptyset)\): \(\nu_{(i, \emptyset)} = \mu_i\).
- For probe \((i, j)\): \(\nu_{(i, j)} = \mathbb{E}[\max(r_i, \mu_j)] - c\).
Regret Definitions:
- Empirical cumulative regret: \(\hat{R}_T = T\nu^* - \sum_{t=1}^T r(t)\).
- Expected cumulative regret: \(R_T = \mathbb{E}[\hat{R}_T]\).
Key Innovation: Regret is relative to \(\nu^*\) (optimal action’s reward), not just the best arm’s mean \(\mu_1\).

Baseline Algorithm: UCB-naive-probe#

Idea: Treats each probe action as a “super arm” with a fixed reference point (threshold for pulling the probe arm).
Action Space: Includes triples \((i, j, d_l)\), where \(d_l \in D\) (discrete reward support) is the reference point.
Steps: 1. At round \(t\), select the super arm with the highest UCB index. 2. If probing: observe \(r_i\); pull \(i\) if \(r_i \geq d_l\), else pull \(j\) (reward \(-c\)).
Regret Bounds:
- Gap-independent: \(O(K\sqrt{T \log T})\).
- Gap-dependent: \(O(K^2 \log T)\).

Proposed Algorithm: UCBP (UCB with Probes)#

Goal: Reduce regret by leveraging the structure of probe actions and optimal reference points.
Core Insight: Use UCB indices to balance exploration/exploitation and dynamically set reference points.
UCB Indices Calculation:
1. Pull action: \(U_i(t) = \hat{\mu}_i(t) + \sqrt{\frac{3 \log t}{N_i(t)}}\), where \(\hat{\mu}_i(t)\) = empirical mean of \(i\), \(N_i(t)\) = times \(i\) was observed.
2. Probe action: \(P_{(i,j)}(t) = \hat{\nu}_{(i,j)}(t) + \sqrt{\frac{3 \log t}{N_i(t)}} + \sqrt{\frac{3 \log t}{N_j(t)}}\), where \(\hat{\nu}_{(i,j)}(t)\) = empirical mean of \(\max(r_i, \hat{\mu}_j) - c\).

UCBP Decision Process (Algorithm 2)#

Initialize: Sample each arm once to estimate initial means.
For each round \(t\): a. Compute \(U_i(t)\) (pull indices) and \(P_{(i,j)}(t)\) (probe indices). b. Choose action with highest index:
- If pull: Pull arm \(i^* = \arg\max_i U_i(t)\), get \(r(t) = r_{i^*}\).
- If probe: Probe arm \(j_t\), observe \(r_{j_t}\).
  - Pull \(j_t\) (reward \(r_{j_t} - c\)) if \(r_{j_t} > U_{k_t}(t)\) (backup arm’s UCB).
  - Else pull backup arm \(k_t\) (reward \(r_{k_t} - c\)).
Update \(\hat{\mu}_i\), \(N_i\), and indices.

Regret Decomposition#

Total regret \(R_T\) splits into two parts:

Action Selection Regret: Loss from choosing suboptimal actions (e.g., pulling a bad arm instead of probing).
- For action \(a\), gap \(\Delta_a = \nu^* - \nu_a\); total loss from suboptimal actions is \(\sum_{a \neq a^*} \Delta_a \cdot (\text{times } a \text{ is chosen})\).
Reference Point Regret: Loss from using estimated thresholds (instead of true \(\mu_j\)) to decide post-probe.
- Defined as \(R_{ref}(T) = \sum_{t=1}^T |\tilde{\mu}_j(t) - \mu_j| \cdot \mathbb{P}(r_i \in [\min(\mu_j, \tilde{\mu}_j(t)), \max(\mu_j, \tilde{\mu}_j(t))])\), where \(\tilde{\mu}_j(t)\) = estimated mean of \(j\).

Reference Point Regret Analysis#

Lemma IV.3: a. For arbitrary bounded rewards: \(R_{ref}(T) = O(\sqrt{K T \log T})\). b. For discrete rewards (support \(D\)): \(R_{ref}(T) = O(K \log T)\), where \(\gamma_i = \min |d_l - \mu_i|\) (gap between \(\mu_i\) and nearest support point).
Rationale: Discrete rewards limit the range of estimation errors, making \(R_{ref}(T)\) smaller.

Gap-Independent Regret Bound for UCBP#

Theorem IV.1: Under Assumption 1 (\(\mathbb{P}(r_i \leq \mu_j) \geq \epsilon > 0\)):

\[ R_T \leq \frac{4\sqrt{6 K T \log T}}{\epsilon} + R_{ref}(T) + \frac{2\pi^2 K^2}{3} + K \]

Result: Combining with \(R_{ref}(T) = O(\sqrt{K T \log T})\), total gap-independent regret is \(O(\sqrt{K T \log T})\).
Significance: Scales better than UCB-naive-probe (\(O(K\sqrt{T \log T})\)) for large \(K\).

Gap-Dependent Regret Bound for UCBP#

Theorem IV.2: For discrete rewards, gap-dependent regret satisfies:

\[ R_T \leq \sum_{i=1}^K \frac{12 \log T}{\delta_i} + R_{ref}(T) + \frac{2\pi^2 K^2}{3} + K \]

where \(\delta_i\) is a gap parameter related to suboptimal actions.

Result: With \(R_{ref}(T) = O(K \log T)\), total gap-dependent regret is \(O(K \log T)\).

Order Optimality of UCBP#

Lower Bound (Theorem IV.4): For any algorithm, regret is at least \(\Omega(K \log T)\).
Conclusion: UCBP’s gap-dependent regret \(O(K \log T)\) matches the lower bound, making it order-optimal.

Empirical Results (MovieLens Dataset)#

Setup: 18 arms (movie genres), rewards = user ratings (1-5). Compare UCBP vs. UCB-naive-probe over 500,000 rounds.
Key Finding: UCBP consistently outperforms UCB-naive-probe across probing costs \(c = 0, 0.3, 1\).
- Both show logarithmic regret growth, but UCBP has smaller cumulative regret.

Contributions and Future Work#

Contributions:
1. First MAB model with costly probing for bounded rewards.
2. UCBP algorithm with gap-independent regret \(O(\sqrt{K T \log T})\) and order-optimal gap-dependent regret \(O(K \log T)\).
3. Novel regret definition accounting for optimal probe/pull actions.
Future Work:
- Extend to noisy probes (instead of exact reward observations).
- Handle correlated arm rewards.

Q&A#

Thank You!

For details, see full paper: Multi-armed bandits with costly probes

5. Multi-Armed Bandits with Probing

Contents

5. Multi-Armed Bandits with Probing#

Multi-Armed Bandits with Probing: Paper Explained by Dr. Fangli Ying#

UCBP and Its Theoretical Guarantees#

Introduction to Multi-Armed Bandits (MAB)#

Key Applications#

Key Applications#

Problem Formulation#

Optimal Action and Regret Definition#

Baseline Algorithm: UCB-naive-probe#

Proposed Algorithm: UCBP (UCB with Probes)#

UCBP Decision Process (Algorithm 2)#

Regret Decomposition#

Reference Point Regret Analysis#

Gap-Independent Regret Bound for UCBP#

Gap-Dependent Regret Bound for UCBP#

Order Optimality of UCBP#

Empirical Results (MovieLens Dataset)#

Contributions and Future Work#

Q&A#