3. Explore-then-Commit (ETC) & Upper-Confidence-Bound (UCB)*

3. Explore-then-Commit (ETC) & Upper-Confidence-Bound (UCB)*#

Review of Lectures#

Explore-then-Commit (ETC) & Upper-Confidence-Bound (UCB)
Bandit Algorithms for Sequential Decision Making

1. Explore–then–Commit (ETC) – High-Level Idea#

Phase 1 – Explore: pull every arm exactly \(m\) times.
Phase 2 – Commit: forever after pull the arm with the largest empirical mean from the first \(m\) pulls.

2. Notation & Core Quantities#

Symbol	Meaning
\(k\)	number of arms
\(n\)	horizon (total rounds)
\(m\)	exploration length per arm
\(\Delta_i = \mu^* - \mu_i\)	sub-optimality gap of arm \(i\)
\(\hat{\mu}_{i,m}\)	empirical mean of arm \(i\) after \(m\) pulls
\(\mu_i\)	true mean reward of arm \(i\)

3. Regret Definition#

Total regret after \(n\) rounds:

\[ R_n(m) = m \sum_{i=1}^{k}\Delta_i + (n-mk)\sum_{i=1}^{k}\Delta_i \mathbb{P}\left(\text{commit to arm }i\right) \]

First term: regret during exploration.
Second term: regret due to possibly wrong commitment.

4. Bounding the Commitment Error#

When rewards are 1-sub-Gaussian:

\[ \mathbb{P}\!\left(\hat{\mu}_{i,m} \ge \hat{\mu}_{1,m}\right) \le \exp\!\left(-\frac{m\Delta_i^2}{2}\right) \]

Hence

\[ \mathbb{P}(\text{commit to sub-optimal }i) \le \exp\!\left(-\frac{m\Delta_i^2}{2}\right) \]

Understanding 1-sub-Gaussian Rewards#

Key Concepts & Probability Bounds#

Symbols & Setup#

\(\hat{\mu}_{i,m}\): Sample mean of \(m\) observations for option \(i\)
Option 1 is optimal: \(\mu_1\) (true mean) is largest
For suboptimal \(i\) (\(i \neq 1\)):
\(\Delta_i = \mu_1 - \mu_i > 0\) (true mean gap)

What is 1-sub-Gaussian?#

A random variable with strong concentration around its mean
Deviations from the mean have exponentially decaying tail probabilities
Sample mean \(\hat{\mu}_{i,m}\) stays close to true mean \(\mu_i\)

First Inequality Explained#

\[\mathbb{P}\!\left(\hat{\mu}_{i,m} \ge \hat{\mu}_{1,m}\right) \le \exp\!\left(-\frac{m\Delta_i^2}{2}\right)\]

Event: Suboptimal \(i\)’s sample mean ≥ optimal 1’s sample mean
This implies:
\((\hat{\mu}_{i,m} - \mu_i) - (\hat{\mu}_{1,m} - \mu_1) \ge \Delta_i\)
1-sub-Gaussian property makes this event exponentially unlikely

Second Inequality Explained#

\[\mathbb{P}(\text{commit to sub-optimal }i) \le \exp\!\left(-\frac{m\Delta_i^2}{2}\right)\]

“Commit to sub-optimal \(i\)” = choosing \(i\) over 1
This happens only if \(\hat{\mu}_{i,m} \ge \hat{\mu}_{1,m}\)
Thus, its probability inherits the same bound

Example#

Optimal arm 1: \(\mu_1 = 10\) (1-sub-Gaussian, e.g., \(N(10,1)\))
Suboptimal arm 2: \(\mu_2 = 8\) (\(\Delta_2 = 2\))
With \(m=5\) samples:
\(\mathbb{P}(\text{choose arm 2}) \le \exp(-\frac{5 \times 2^2}{2}) = e^{-10} \approx 4.5 \times 10^{-5}\)

Takeaway#

Larger \(m\) (samples) → smaller probability of choosing suboptimal
Larger \(\Delta_i\) (bigger gap) → smaller error probability
Exponential decay ensures reliable selection with enough samples

5. Two-Arms Case – Choosing \(m\)#

For \(k=2\) and \(\Delta = \Delta_2\):

\[ R_n(m) \approx m\Delta + (n-2m)\Delta \exp\!\left(-\frac{m\Delta^2}{2}\right) \]

Take derivative w.r.t. \(m\) and set to zero \(\Rightarrow\) optimal

\[ m^* = \left\lceil\frac{4\log n}{\Delta^2}\right\rceil \]

gives

\[ R_n = O\!\left(\frac{\log n}{\Delta}\right) \]

6. General \(k\) Arms#

Optimal exploration length:

\[ m = \max_{i:\Delta_i>0}\frac{4\log n}{\Delta_i^2} \]

Resulting regret:

\[ R_n \le \sum_{i:\Delta_i>0}\frac{4\log n}{\Delta_i} + \text{small terms} \]

7. Practical Caveat#

Choosing \(m\) above requires knowledge of \(\Delta_i\) and \(n\)
– often unknown in practice.

8. Data-Dependent Fix#

Instead of a fixed \(m\), keep exploring until:

\[ t \ge 2^t \text{ threshold (function of observed data)} \]

Achieves the same \(\sum_i\frac{4\log n}{\Delta_i}\) regret without prior \(\Delta_i\).

9. Scenario Comparison#

Scenario	\(\Delta\)	Relative Regret
1	0.2	smaller
2	0.1	larger

Larger \(\Delta\) \(\Rightarrow\) fewer mistakes, even though per-mistake regret is bigger.

10. Motivation for UCB#

ETC weaknesses:

Needs \(\Delta_i\) to set \(m\).
Abrupt switch from explore to exploit.
All arms equally explored.

UCB addresses all three issues via optimism under uncertainty.

11. UCB – Main Idea#

In each round \(t\) compute an upper confidence bound (UCB) index for every arm:

\[ \text{UCB}_{i,t-1} = \underbrace{\hat{\mu}_{i,T_i(t-1)}}_{\text{empirical mean}} + \underbrace{\sqrt{\frac{2\log(1/\delta)}{T_i(t-1)}}}_{\text{exploration bonus}} \]

Pull arm

\[ A_t = \arg\max_i \text{UCB}_{i,t-1} \]

12. Confidence Bound Lemma#

For 1-sub-Gaussian rewards:

\[ \mathbb{P}\!\left(\forall s\ge 1,\ |\hat{\mu}_s - \mu| \le \sqrt{\frac{2\log(2/\delta)}{s}}\right) \ge 1-\delta \]

Hence choose

\[ \delta_t = \frac{1}{t^4} \quad\Rightarrow\quad \text{bonus} = \sqrt{\frac{2\log t}{T_i(t-1)}} \]

13. UCB Algorithm (Pseudo-code)#

The UCB algorithm operates as follows:

Initialization: Take the number of arms k as input
For each round t from 1 to n:
- For each arm i:
  - If the arm has never been pulled before, set its UCB value to +∞
  - Otherwise, calculate its UCB value as: empirical mean \((μ̂_i)\) plus the exploration bonus \(\sqrt{\frac{2\log(1/\delta)}{T_i(t-1)}}\) , where \(T_i(t-1)\) is the number of times arm i has been pulled before round \(t\)
- Select the arm with the highest UCB value \((A_t = argmax_i UCB_{i,t-1})\)
- Pull the selected arm and observe the reward \(X_t\)
Repeat this process until all n rounds are completed

14. Intuition – Why UCB Works#

Confidence intervals shrink as \(T_i\) grows.
Best arm’s UCB \(\approx\) its true mean (many pulls).
Sub-optimal arm stops being pulled when

\[ \sqrt{\frac{2\log t}{T_i(t-1)}} \le \frac{\Delta_i}{2} \quad\Rightarrow\quad T_i(t-1) \approx \frac{8\log t}{\Delta_i^2} \]

Hence pulls of arm \(i\) are inversely proportional to \(\Delta_i^2\).

15. Visualization – 3 Arms#

Means: \(\mu_1=0.8,\ \mu_2=0.7,\ \mu_3=0.6\)

Arm	\(\Delta_i\)	Typical pulls
1	0	\(\gg\log n\)
2	0.1	\(\propto\frac{\log n}{0.1^2}\)
3	0.2	\(\propto\frac{\log n}{0.2^2}\)

Confidence intervals shrink until their width matches the respective \(\Delta_i\).

16. Take-Away Summary#

Algorithm	Needs \(\Delta_i\)?	Regret Bound	Abrupt Switch?
ETC	Yes	\(O\!\left(\frac{k\log n}{\Delta}\right)\)	Yes
UCB	No	\(O\!\left(\sum_i\frac{\log n}{\Delta_i}\right)\)	No (smooth)

UCB achieves near-optimal regret while being fully adaptive.