3. Explore-then-Commit (ETC) & Upper-Confidence-Bound (UCB)*#
Review of Lectures#
Explore-then-Commit (ETC) & Upper-Confidence-Bound (UCB)
Bandit Algorithms for Sequential Decision Making
1. Explore–then–Commit (ETC) – High-Level Idea#
Phase 1 – Explore: pull every arm exactly \(m\) times.
Phase 2 – Commit: forever after pull the arm with the largest empirical mean from the first \(m\) pulls.
2. Notation & Core Quantities#
Symbol |
Meaning |
|---|---|
\(k\) |
number of arms |
\(n\) |
horizon (total rounds) |
\(m\) |
exploration length per arm |
\(\Delta_i = \mu^* - \mu_i\) |
sub-optimality gap of arm \(i\) |
\(\hat{\mu}_{i,m}\) |
empirical mean of arm \(i\) after \(m\) pulls |
\(\mu_i\) |
true mean reward of arm \(i\) |
3. Regret Definition#
Total regret after \(n\) rounds:
First term: regret during exploration.
Second term: regret due to possibly wrong commitment.
4. Bounding the Commitment Error#
When rewards are 1-sub-Gaussian:
Hence
Understanding 1-sub-Gaussian Rewards#
Key Concepts & Probability Bounds#
Symbols & Setup#
\(\hat{\mu}_{i,m}\): Sample mean of \(m\) observations for option \(i\)
Option 1 is optimal: \(\mu_1\) (true mean) is largest
For suboptimal \(i\) (\(i \neq 1\)):
\(\Delta_i = \mu_1 - \mu_i > 0\) (true mean gap)
What is 1-sub-Gaussian?#
A random variable with strong concentration around its mean
Deviations from the mean have exponentially decaying tail probabilities
Sample mean \(\hat{\mu}_{i,m}\) stays close to true mean \(\mu_i\)
First Inequality Explained#
Event: Suboptimal \(i\)’s sample mean ≥ optimal 1’s sample mean
This implies:
\((\hat{\mu}_{i,m} - \mu_i) - (\hat{\mu}_{1,m} - \mu_1) \ge \Delta_i\)1-sub-Gaussian property makes this event exponentially unlikely
Second Inequality Explained#
“Commit to sub-optimal \(i\)” = choosing \(i\) over 1
This happens only if \(\hat{\mu}_{i,m} \ge \hat{\mu}_{1,m}\)
Thus, its probability inherits the same bound
Example#
Optimal arm 1: \(\mu_1 = 10\) (1-sub-Gaussian, e.g., \(N(10,1)\))
Suboptimal arm 2: \(\mu_2 = 8\) (\(\Delta_2 = 2\))
With \(m=5\) samples:
\(\mathbb{P}(\text{choose arm 2}) \le \exp(-\frac{5 \times 2^2}{2}) = e^{-10} \approx 4.5 \times 10^{-5}\)
Takeaway#
Larger \(m\) (samples) → smaller probability of choosing suboptimal
Larger \(\Delta_i\) (bigger gap) → smaller error probability
Exponential decay ensures reliable selection with enough samples
5. Two-Arms Case – Choosing \(m\)#
For \(k=2\) and \(\Delta = \Delta_2\):
Take derivative w.r.t. \(m\) and set to zero \(\Rightarrow\) optimal
gives
6. General \(k\) Arms#
Optimal exploration length:
Resulting regret:
7. Practical Caveat#
Choosing \(m\) above requires knowledge of \(\Delta_i\) and \(n\)
– often unknown in practice.
8. Data-Dependent Fix#
Instead of a fixed \(m\), keep exploring until:
Achieves the same \(\sum_i\frac{4\log n}{\Delta_i}\) regret without prior \(\Delta_i\).
9. Scenario Comparison#
Scenario |
\(\Delta\) |
Relative Regret |
|---|---|---|
1 |
0.2 |
smaller |
2 |
0.1 |
larger |
Larger \(\Delta\) \(\Rightarrow\) fewer mistakes, even though per-mistake regret is bigger.
10. Motivation for UCB#
ETC weaknesses:
Needs \(\Delta_i\) to set \(m\).
Abrupt switch from explore to exploit.
All arms equally explored.
UCB addresses all three issues via optimism under uncertainty.
11. UCB – Main Idea#
In each round \(t\) compute an upper confidence bound (UCB) index for every arm:
Pull arm
12. Confidence Bound Lemma#
For 1-sub-Gaussian rewards:
Hence choose
13. UCB Algorithm (Pseudo-code)#
The UCB algorithm operates as follows:
Initialization: Take the number of arms k as input
For each round t from 1 to n:
For each arm i:
If the arm has never been pulled before, set its UCB value to +∞
Otherwise, calculate its UCB value as: empirical mean \((μ̂_i)\) plus the exploration bonus \(\sqrt{\frac{2\log(1/\delta)}{T_i(t-1)}}\) , where \(T_i(t-1)\) is the number of times arm i has been pulled before round \(t\)
Select the arm with the highest UCB value \((A_t = argmax_i UCB_{i,t-1})\)
Pull the selected arm and observe the reward \(X_t\)
Repeat this process until all n rounds are completed
14. Intuition – Why UCB Works#
Confidence intervals shrink as \(T_i\) grows.
Best arm’s UCB \(\approx\) its true mean (many pulls).
Sub-optimal arm stops being pulled when
Hence pulls of arm \(i\) are inversely proportional to \(\Delta_i^2\).
15. Visualization – 3 Arms#
Means: \(\mu_1=0.8,\ \mu_2=0.7,\ \mu_3=0.6\)
Arm |
\(\Delta_i\) |
Typical pulls |
|---|---|---|
1 |
0 |
\(\gg\log n\) |
2 |
0.1 |
\(\propto\frac{\log n}{0.1^2}\) |
3 |
0.2 |
\(\propto\frac{\log n}{0.2^2}\) |
Confidence intervals shrink until their width matches the respective \(\Delta_i\).
16. Take-Away Summary#
Algorithm |
Needs \(\Delta_i\)? |
Regret Bound |
Abrupt Switch? |
|---|---|---|---|
ETC |
Yes |
\(O\!\left(\frac{k\log n}{\Delta}\right)\) |
Yes |
UCB |
No |
\(O\!\left(\sum_i\frac{\log n}{\Delta_i}\right)\) |
No (smooth) |
UCB achieves near-optimal regret while being fully adaptive.