4. Principles and Performance Comparison of ETC and UCB#
Table of Contents#
Core Concepts and Notations
ETC (Explore-Then-Commit) Algorithm
UCB (Upper Confidence Bound) Algorithm
Regret Analysis
Improved UCB: Asymptotically Optimal UCB
Key Conclusions
I. Core Concepts and Notations#
Key Symbols#
Suboptimality Gap (\(\Delta_i\)): \(\Delta_i = \mu^* - \mu_i\), where \(\mu^*\) is the optimal mean reward, and \(\mu_i\) is the mean reward of arm \(i\).
Horizon (\(n\)): Total number of rounds the algorithm runs.
Cumulative Regret (\(R_n\)): \(R_n = \sum (\text{Losses from choosing suboptimal arms})\)
Exploration vs. Exploitation: Fundamental trade-off between gathering information (exploration) and maximizing immediate reward (exploitation).
II. ETC (Explore-Then-Commit) Algorithm#
Two Phases#
Exploration Phase: First \(m\) rounds, each arm is sampled \(m\) times.
Exploitation Phase: In remaining rounds, only the best-performing arm from exploration is chosen.
Core challenge: Determining the optimal exploration count \(m\).
ETC: Optimal Parameter \(m\) Selection#
Single Suboptimality Gap (\(\Delta\))#
When there’s only one suboptimal arm (gap = \(\Delta\)):
Optimal exploration count: $\( m = \frac{4\log n}{\Delta^2} \)$
Corresponding cumulative regret: $\( R_n = \frac{4}{\Delta}\log n \)$
ETC: Deriving Optimal m and Regret (Single Gap)#
Scenario Setup#
2-armed bandit problem:
1 optimal arm with mean reward \(\mu^*\)
1 suboptimal arm with mean reward \(\mu\)
Suboptimality gap: \(\Delta = \mu^* - \mu\)
Total rounds (horizon): \(n\)
ETC Structure#
Exploration phase: \(2m\) rounds (each arm sampled \(m\) times)
Exploitation phase: Remaining \(n - 2m\) rounds (choose best arm from exploration)
Goal: Choose \(m\) to minimize cumulative regret \(R_n\)
Key Consideration#
Probability of misidentifying the optimal arm during exploration must be small.
Using concentration inequalities (e.g., Hoeffding’s):
Need enough samples (\(m\)) to distinguish \(\mu^*\) from \(\mu\)
Larger \(\Delta\): Easier to distinguish (smaller \(m\) needed)
Larger \(n\): Need more exploration to keep misidentification probability low
Deriving Optimal m#
To ensure misidentification probability is negligible (order \(1/n\)):
Mathematical derivation shows \(m\) must scale with:
\(\log n\) (to control probability over horizon)
\(1/\Delta^2\) (inverse square of gap, as larger gaps need fewer samples)
Result: Optimal exploration count $\( m = \frac{4\log n}{\Delta^2} \)$
Calculating Cumulative Regret#
With optimal \(m\), correct identification is highly likely:
Regret comes primarily from exploration phase
We sample the suboptimal arm \(m\) times
Each such sample incurs loss \(\Delta\)
Total regret: \(R_n = m \cdot \Delta\)
Substituting \(m\): $\( R_n = \frac{4\log n}{\Delta^2} \cdot \Delta = \frac{4}{\Delta}\log n \)$
Interpretation#
Regret grows logarithmically with horizon \(n\) (favorable scaling)
Regret decreases as gap \(\Delta\) increases (intuitive: easier to find optimal arm)
The factor of 4 comes from concentration inequality constants
ETC: Optimal Parameter \(m\) Selection#
Multiple Arms (\(k\) arms)#
With suboptimality gaps \(\Delta_1, \Delta_2, ..., \Delta_k\):
Optimal exploration count considers the minimum squared gap: $\( m = \left\lceil \frac{4}{\min_{j \neq i} \Delta_j^2} \right\rceil \)$
Cumulative regret as sum of individual contributions: $\( R_n = \sum_{\substack{i=1 \\ i \neq *}}^k \left( \frac{4\log n}{\min_{j \neq i} \Delta_j^2} \right) \cdot \Delta_i \)$
ETC with Multiple Arms: Deriving Optimal m and Regret#
Scenario Setup#
\(k\) armed bandit problem
One optimal arm with mean reward \(\mu^*\)
\(k-1\) suboptimal arms with mean rewards \(\mu_1, \mu_2, ..., \mu_{k-1}\)
Suboptimality gaps: \(\Delta_i = \mu^* - \mu_i\) for each suboptimal arm \(i\)
Horizon (total rounds): \(n\)
Key Challenge with Multiple Arms#
Need to ensure we correctly identify the optimal arm during exploration.
For \(k\) arms, exploration phase requires \(k \cdot m\) rounds (each arm sampled \(m\) times)
Critical risk: Any suboptimal arm could be mistakenly identified as optimal if under-explored
Smallest gap \(\Delta_{\text{min}} = \min(\Delta_1, \Delta_2, ..., \Delta_{k-1})\) creates highest risk of misidentification
Why Minimum Squared Gap?#
The smallest gap \(\Delta_{\text{min}}\) determines required exploration:
Arms with smaller \(\Delta_i\) are harder to distinguish from the optimal arm
Mathematically: Probability of misidentification depends on \(m \cdot \Delta_i^2\)
To ensure all suboptimal arms are correctly identified, we must satisfy the most stringent condition (smallest \(\Delta_i\))
Thus: \(\Delta_{\text{min}}^2 = \min(\Delta_1^2, \Delta_2^2, ..., \Delta_{k-1}^2)\)
Deriving Optimal m#
Using concentration inequalities (e.g., Hoeffding’s) to bound misidentification probability:
For reliable identification of all suboptimal arms, \(m\) must satisfy: $\( m \cdot \Delta_{\text{min}}^2 \geq 4\log n \)$
Rearranging gives the minimum required \(m\): $\( m = \left\lceil \frac{4}{\Delta_{\text{min}}^2} \right\rceil \)$
The ceiling function ensures \(m\) is an integer (cannot sample a fraction of rounds)
Calculating Cumulative Regret#
Regret comes from two sources:
Exploration phase: Each suboptimal arm is sampled \(m\) times, contributing \(m \cdot \Delta_i\) per arm
Exploitation phase: Negligible if optimal arm is correctly identified (highly likely with proper \(m\))
Total regret is sum of exploration losses across all suboptimal arms: $\( R_n = \sum_{\substack{i=1 \\ i \neq *}}^{k-1} m \cdot \Delta_i \)$
Substituting Optimal m#
Substitute \(m = \frac{4}{\Delta_{\text{min}}^2}\) into the regret formula:
With logarithmic factor (to control probability over horizon \(n\)):
Interpretation#
The smallest gap \(\Delta_{\text{min}}\) dominates both \(m\) and total regret
Arms with larger \(\Delta_i\) contribute less to regret but still require exploration determined by \(\Delta_{\text{min}}\)
Regret grows with \(k\) (more suboptimal arms) and \(\log n\) (horizon), but decreases with larger gaps
This formulation ensures all suboptimal arms are properly identified while minimizing total regret.
ETC: Core Principle#
Performance heavily depends on \(m\):
Too small \(m\): Insufficient exploration (misidentifying the optimal arm)
Too large \(m\): Wasted rounds (increased regret)
\(m\) must be optimized based on \(\Delta\) and \(n\).
III. UCB (Upper Confidence Bound) Algorithm#
Decision Rule#
At round \(t\), select arm \(A_t\): $\( A_t = \arg\max_i \left( \hat{\mu}_i(t-1) + \sqrt{\frac{4}{T_i(t-1)} \cdot \log t} \right) \)$
Where:
\(\hat{\mu}_i(t-1)\): Sample mean reward of arm \(i\) up to round \(t-1\)
\(T_i(t-1)\): Number of times arm \(i\) has been selected up to round \(t-1\)
UCB: Core Principle - Dynamic Confidence Intervals#
Each arm’s true mean \(\mu_i\) is enclosed in a confidence interval
Interval width shrinks as sample count \(T_i\) increases (more samples → more accurate estimates)
Optimal arm’s UCB converges to its true mean \(\mu^*\)
Suboptimal arms’ UCBs eventually fall below the optimal arm’s lower bound
Balances exploration (uncertain arms get higher weights) and exploitation (high mean arms preferred) dynamically.
UCB: Stopping Condition for Suboptimal Arms#
A suboptimal arm \(i\) stops being selected when: $\( \text{UCB}_i \leq \text{LCB}_* \quad \Leftrightarrow \quad \text{UCB}_i - \text{LCB}_i \leq \Delta_i \)$
When the interval width of arm \(i\) is less than its suboptimality gap \(\Delta_i\), the algorithm is “confident” it’s suboptimal.
Corresponding selection count: \(T_i \approx 16\log n / \Delta_i^2\)
UCB: Stopping Condition for Suboptimal Arms#
Principle and Derivation#
Recall UCB Basics#
UCB (Upper Confidence Bound) algorithm selects arms based on: $\( \text{UCB}_i(t) = \hat{\mu}_i(t) + \sqrt{\frac{4 \log t}{T_i(t)}} \)$
\(\hat{\mu}_i(t)\): Sample mean of arm \(i\) after \(t\) rounds
\(T_i(t)\): Number of times arm \(i\) has been selected by round \(t\)
\(\sqrt{\frac{4 \log t}{T_i(t)}}\): Confidence interval width (exploration bonus)
LCB (Lower Confidence Bound) for the optimal arm \(*\): $\( \text{LCB}_*(t) = \hat{\mu}_*(t) - \sqrt{\frac{4 \log t}{T_*(t)}} \)$
Stopping Condition Logic#
A suboptimal arm \(i\) stops being selected when: $\( \text{UCB}_i \leq \text{LCB}_* \)$
Intuition:
The upper bound of arm \(i\) (best-case scenario for \(i\)) is no better than the lower bound of the optimal arm \(*\) (worst-case scenario for \(*\)).
At this point, we are “confident” that arm \(i\) is suboptimal and will never select it again.
Equivalence to Interval Width Condition#
The stopping condition \(\text{UCB}_i \leq \text{LCB}_*\) can be rewritten as: $\( \text{UCB}_i - \text{LCB}_i \leq \Delta_i \)$
Derivation:
Start with \(\text{UCB}_i \leq \text{LCB}_*\)
Substitute definitions: $\( \hat{\mu}_i + \sqrt{\frac{4 \log t}{T_i}} \leq \hat{\mu}_* - \sqrt{\frac{4 \log t}{T_*}} \)$
Rearrange using \(\Delta_i = \mu_* - \mu_i\) (true gap) and approximate \(\hat{\mu}_* - \hat{\mu}_i \approx \Delta_i\) (for large \(T_i, T_*\)): $\( \sqrt{\frac{4 \log t}{T_i}} + \sqrt{\frac{4 \log t}{T_*}} \leq \Delta_i \)$
For simplicity, assume \(T_i \approx T_*\) (balanced sampling), so: $\( 2\sqrt{\frac{4 \log t}{T_i}} \leq \Delta_i \implies \text{UCB}_i - \text{LCB}_i \leq \Delta_i \)$
Why Interval Width < Gap?#
The interval width for arm \(i\) is \(\text{UCB}_i - \text{LCB}_i = 2\sqrt{\frac{4 \log t}{T_i}} = \sqrt{\frac{16 \log t}{T_i}}\)
When this width is smaller than \(\Delta_i\), the true mean of \(i\) (\(\mu_i\)) must be less than the true mean of \(*\) (\(\mu_*\)): $\( \mu_i + (\text{width}/2) < \mu_* - (\text{width}/2) \implies \mu_i < \mu_* - \text{width} < \mu_* - \Delta_i + \mu_i \implies \mu_* > \mu_i \)$
Selection Count \(T_i \approx 16\log n / \Delta_i^2\)#
To find when the interval width is less than \(\Delta_i\):
Set \(\sqrt{\frac{16 \log n}{T_i}} \leq \Delta_i\) (using horizon \(n\) as \(t \approx n\))
Square both sides: \(\frac{16 \log n}{T_i} \leq \Delta_i^2\)
Rearrange: \(T_i \geq \frac{16 \log n}{\Delta_i^2}\)
Meaning: Arm \(i\) needs to be sampled at least \(16\log n / \Delta_i^2\) times to ensure its interval width is smaller than \(\Delta_i\), after which it stops being selected.
Key Takeaways#
Stopping condition ensures suboptimal arms are permanently discarded once we’re confident they’re worse than the best arm.
The interval width shrinks as \(T_i\) increases (more samples → more precision).
The required number of samples \(T_i\) depends on \(\Delta_i\) (smaller gaps need more samples) and \(n\) (larger horizons need more samples to maintain confidence).
This balances exploration (sampling uncertain arms) and exploitation (focusing on proven good arms).
IV. Regret Analysis#
Cumulative Regret (\(R_n\)): Measures total loss compared to “always choosing the optimal arm”.
UCB Regret Bound#
With \(\delta = 1/n^2\) (confidence control): $\( R_n \leq \sum_{\text{suboptimal } i} \frac{32\log n}{\Delta_i} \)$
Smaller \(\Delta_i\) → larger regret (harder to identify suboptimality)
Recall UCB’s Core Components#
UCB selects arms using the index:
$\( \text{UCB}_i(t) = \hat{\mu}_i(t) + \sqrt{\frac{4 \log t}{T_i(t)}} \)\( where \)T_i(t)\( is the number of times arm \)i\( has been selected by round \)t$.A suboptimal arm \(i\) stops being selected when its interval width is smaller than its gap \(\Delta_i\):
$\( \text{UCB}_i - \text{LCB}_i \leq \Delta_i \)$This occurs when \(T_i \approx \frac{16 \log n}{\Delta_i^2}\) (from earlier stopping condition).
Regret Contribution of a Suboptimal Arm#
For a suboptimal arm \(i\) with gap \(\Delta_i\), its total regret contribution comes from:
Number of times it is selected × Per-selection loss (\(\Delta_i\)).
Let \(T_i\) be the total selections of arm \(i\) by horizon \(n\).
Regret from arm \(i\): \(R_i = T_i \cdot \Delta_i\).
Bounding \(T_i\) for UCB#
To derive the regret bound, we first bound \(T_i\) (selections of suboptimal arm \(i\)):
From the stopping condition, arm \(i\) stops being selected when:
$\( \sqrt{\frac{16 \log n}{T_i}} \leq \Delta_i \)$Rearranging gives the maximum \(T_i\) needed to satisfy the condition:
$\( T_i \leq \frac{16 \log n}{\Delta_i^2} \)$This is the upper bound on how many times arm \(i\) is selected.
Calculating Regret for One Suboptimal Arm#
Substitute \(T_i \leq \frac{16 \log n}{\Delta_i^2}\) into the regret formula for arm \(i\):
$\( R_i = T_i \cdot \Delta_i \leq \frac{16 \log n}{\Delta_i^2} \cdot \Delta_i = \frac{16 \log n}{\Delta_i} \)$
This is the regret contribution from a single suboptimal arm.
Why the Factor 32?#
The 32 comes from conservative confidence bounds and probability union bounds:
UCB uses a confidence parameter \(\delta = 1/n^2\) to control the probability of misestimation across all rounds and arms.
To ensure the bound holds with high probability (e.g., \(1 - 1/n\)), we multiply by a safety factor of 2.
This accounts for worst-case scenarios in concentration inequalities (e.g., Hoeffding’s inequality constants) and union bounds over multiple arms/rounds.
Result: \(16 \times 2 = 32\)
Total Cumulative Regret#
Summing over all suboptimal arms, the total regret bound becomes:
$\( R_n = \sum_{\text{suboptimal } i} R_i \leq \sum_{\text{suboptimal } i} \frac{32 \log n}{\Delta_i} \)$
Key Takeaways#
The 32 is a conservative constant derived from:
16 (from the interval width condition for stopping selections)
×2 (to account for confidence bounds and union probabilities with \(\delta = 1/n^2\)).
Smaller \(\Delta_i\) leads to larger regret because:
Arms with smaller gaps require more samples (\(T_i \propto 1/\Delta_i^2\)) to be confidently identified as suboptimal.
Each selection of such arms incurs a loss proportional to \(\Delta_i\), leading to \(R_i \propto 1/\Delta_i\).
Regret Comparison: ETC vs. UCB#
Aspect |
ETC |
UCB |
|---|---|---|
Advantages |
Slightly lower regret for \(k=2\) with known \(\Delta\) |
No prior knowledge of \(\Delta\) needed; Better performance for \(k>2\) or small \(\Delta_i\) |
Limitations |
Relies on knowing \(\Delta\); Poor adaptation to multiple small gaps |
Slightly higher regret than optimally tuned ETC in simple cases (\(k=2\), known \(\Delta\)) |
V. Improved UCB: Asymptotically Optimal UCB#
Original UCB requires prior knowledge of \(n\). Asymptotically optimal UCB solves this, becoming an “anytime algorithm”:
Exploration bonus modified to: $\( \text{UCB Index} = \hat{\mu}_i(t-1) + \sqrt{\frac{2\log(f(t))}{T_i(t-1)}} \quad \text{where} \quad f(t) = 1 + t\log^2(t) \)$
Achieves near-optimal regret bound: $\( \limsup_{n \to \infty} \frac{R_n}{\sum \frac{2\log n}{\Delta_i}} \leq 1 \)$
Original UCB vs. Asymptotically Optimal UCB#
Principle and Derivation Comparison#
1. Original UCB: Core Principles#
Decision Rule: Selects arm with highest upper confidence bound at each round: $\( \text{UCB}_i(t) = \hat{\mu}_i(t-1) + \sqrt{\frac{4\log t}{T_i(t-1)}} \)$
Key Limitation: Relies on knowing the total horizon \(n\) in advance (needed to set confidence bounds for regret guarantees).
Exploration Bonus: Scales with \(\log t\), where \(t\) is the current round, but tied to the fixed horizon \(n\) for regret analysis.
Regret Bound: \( R_n \leq \sum_{\text{suboptimal } i} \frac{32\log n}{\Delta_i} \) (constant factor 32, dependent on \(n\)).
2. Asymptotically Optimal UCB: Motivation#
Problem with Original UCB: Requires prior knowledge of \(n\) (total rounds), making it unsuitable for “anytime” scenarios (unknown horizon).
Goal: Design an algorithm that works without knowing \(n\) in advance and achieves near-optimal regret as \(n \to \infty\).
Intuition: Adjust the exploration bonus to grow appropriately with \(t\) (current round) instead of relying on \(n\), ensuring robustness to unknown horizons.
3. Asymptotically Optimal UCB: Modified Exploration Bonus#
New UCB Index: $\( \text{UCB Index} = \hat{\mu}_i(t-1) + \sqrt{\frac{2\log(f(t))}{T_i(t-1)}} \)\( Where \)\( f(t) = 1 + t\log^2(t) \)\( (a function growing with \)t\( to replace fixed \)n$).
Why This Form?:
\(f(t)\) ensures the exploration bonus shrinks at the right rate: fast enough to avoid excessive exploration, slow enough to guarantee identification of optimal arms.
The coefficient 2 replaces 4 from original UCB, tightening the bound asymptotically.
4. Derivation of the Exploration Bonus#
Confidence Interval Adjustment: Uses concentration inequalities (e.g., Hoeffding) with a time-dependent confidence level. For unknown \(n\), the failure probability for each round \(t\) is bounded by \(1/t^2\) (instead of \(1/n^2\) in original UCB).
Summing Failures: By the union bound, the total failure probability over all rounds is bounded (since \(\sum 1/t^2\) converges).
Resulting Bonus: The term \(\log(f(t))\) emerges from balancing the need to control cumulative failure probability while avoiding over-exploration.
5. Near-Optimal Regret Bound#
Asymptotic Regret: $\( \limsup_{n \to \infty} \frac{R_n}{\sum \frac{2\log n}{\Delta_i}} \leq 1 \)$
Interpretation:
As \(n\) becomes very large, the regret of the improved UCB is at most the “ideal” regret (scaling with \(\sum \frac{2\log n}{\Delta_i}\)).
The constant factor approaches 1, making it asymptotically optimal (original UCB has a larger constant like 32).
6. Key Differences Summary#
Aspect |
Original UCB |
Asymptotically Optimal UCB |
|---|---|---|
Horizon Requirement |
Needs prior knowledge of \(n\) |
Works with unknown \(n\) (anytime) |
Exploration Bonus |
\(\sqrt{\frac{4\log t}{T_i}}\) |
\(\sqrt{\frac{2\log(f(t))}{T_i}}\) where \(f(t)=1+t\log^2(t)\) |
Regret Constant |
32 (conservative) |
Approaches 1 asymptotically |
Use Case |
Known horizon scenarios |
Unknown horizon, large \(n\) scenarios |
7. Significance of Asymptotic Optimality#
Eliminates the need for tuning based on \(n\), making it more practical for real-world problems (e.g., online recommendation systems, A/B testing with variable duration).
Achieves the best possible regret scaling as \(n\) grows, matching the theoretical lower bound for multi-armed bandits (up to a constant factor).
VI. Key Conclusions#
ETC: Suitable for scenarios with known \(\Delta\) and few arms (\(k=2\)); Requires precise tuning of \(m\).
UCB: Better for unknown \(\Delta\), multiple arms (\(k>2\)), or small suboptimality gaps; More robust and widely used in practice.
Exploration-Exploitation Balance: UCB achieves dynamic balance via confidence intervals, while ETC uses fixed exploration—making UCB more adaptable to complex scenarios.