3. Explore-then-Commit (ETC) & Upper-Confidence-Bound (UCB)*#


Review of Lectures#

Explore-then-Commit (ETC) & Upper-Confidence-Bound (UCB)
Bandit Algorithms for Sequential Decision Making


1. Explore–then–Commit (ETC) – High-Level Idea#

  • Phase 1 – Explore: pull every arm exactly \(m\) times.

  • Phase 2 – Commit: forever after pull the arm with the largest empirical mean from the first \(m\) pulls.


2. Notation & Core Quantities#

Symbol

Meaning

\(k\)

number of arms

\(n\)

horizon (total rounds)

\(m\)

exploration length per arm

\(\Delta_i = \mu^* - \mu_i\)

sub-optimality gap of arm \(i\)

\(\hat{\mu}_{i,m}\)

empirical mean of arm \(i\) after \(m\) pulls

\(\mu_i\)

true mean reward of arm \(i\)


3. Regret Definition#

Total regret after \(n\) rounds:

\[ R_n(m) = m \sum_{i=1}^{k}\Delta_i + (n-mk)\sum_{i=1}^{k}\Delta_i \mathbb{P}\left(\text{commit to arm }i\right) \]
  • First term: regret during exploration.

  • Second term: regret due to possibly wrong commitment.


4. Bounding the Commitment Error#

When rewards are 1-sub-Gaussian:

\[ \mathbb{P}\!\left(\hat{\mu}_{i,m} \ge \hat{\mu}_{1,m}\right) \le \exp\!\left(-\frac{m\Delta_i^2}{2}\right) \]

Hence

\[ \mathbb{P}(\text{commit to sub-optimal }i) \le \exp\!\left(-\frac{m\Delta_i^2}{2}\right) \]

Understanding 1-sub-Gaussian Rewards#

Key Concepts & Probability Bounds#


Symbols & Setup#

  • \(\hat{\mu}_{i,m}\): Sample mean of \(m\) observations for option \(i\)

  • Option 1 is optimal: \(\mu_1\) (true mean) is largest

  • For suboptimal \(i\) (\(i \neq 1\)):
    \(\Delta_i = \mu_1 - \mu_i > 0\) (true mean gap)


What is 1-sub-Gaussian?#

  • A random variable with strong concentration around its mean

  • Deviations from the mean have exponentially decaying tail probabilities

  • Sample mean \(\hat{\mu}_{i,m}\) stays close to true mean \(\mu_i\)


First Inequality Explained#

\[\mathbb{P}\!\left(\hat{\mu}_{i,m} \ge \hat{\mu}_{1,m}\right) \le \exp\!\left(-\frac{m\Delta_i^2}{2}\right)\]
  • Event: Suboptimal \(i\)’s sample mean ≥ optimal 1’s sample mean

  • This implies:
    \((\hat{\mu}_{i,m} - \mu_i) - (\hat{\mu}_{1,m} - \mu_1) \ge \Delta_i\)

  • 1-sub-Gaussian property makes this event exponentially unlikely


Second Inequality Explained#

\[\mathbb{P}(\text{commit to sub-optimal }i) \le \exp\!\left(-\frac{m\Delta_i^2}{2}\right)\]
  • “Commit to sub-optimal \(i\)” = choosing \(i\) over 1

  • This happens only if \(\hat{\mu}_{i,m} \ge \hat{\mu}_{1,m}\)

  • Thus, its probability inherits the same bound


Example#

  • Optimal arm 1: \(\mu_1 = 10\) (1-sub-Gaussian, e.g., \(N(10,1)\))

  • Suboptimal arm 2: \(\mu_2 = 8\) (\(\Delta_2 = 2\))

  • With \(m=5\) samples:
    \(\mathbb{P}(\text{choose arm 2}) \le \exp(-\frac{5 \times 2^2}{2}) = e^{-10} \approx 4.5 \times 10^{-5}\)


Takeaway#

  • Larger \(m\) (samples) → smaller probability of choosing suboptimal

  • Larger \(\Delta_i\) (bigger gap) → smaller error probability

  • Exponential decay ensures reliable selection with enough samples


5. Two-Arms Case – Choosing \(m\)#

For \(k=2\) and \(\Delta = \Delta_2\):

\[ R_n(m) \approx m\Delta + (n-2m)\Delta \exp\!\left(-\frac{m\Delta^2}{2}\right) \]

Take derivative w.r.t. \(m\) and set to zero \(\Rightarrow\) optimal

\[ m^* = \left\lceil\frac{4\log n}{\Delta^2}\right\rceil \]

gives

\[ R_n = O\!\left(\frac{\log n}{\Delta}\right) \]

6. General \(k\) Arms#

Optimal exploration length:

\[ m = \max_{i:\Delta_i>0}\frac{4\log n}{\Delta_i^2} \]

Resulting regret:

\[ R_n \le \sum_{i:\Delta_i>0}\frac{4\log n}{\Delta_i} + \text{small terms} \]

7. Practical Caveat#

Choosing \(m\) above requires knowledge of \(\Delta_i\) and \(n\)
often unknown in practice.


8. Data-Dependent Fix#

Instead of a fixed \(m\), keep exploring until:

\[ t \ge 2^t \text{ threshold (function of observed data)} \]

Achieves the same \(\sum_i\frac{4\log n}{\Delta_i}\) regret without prior \(\Delta_i\).


9. Scenario Comparison#

Scenario

\(\Delta\)

Relative Regret

1

0.2

smaller

2

0.1

larger

Larger \(\Delta\) \(\Rightarrow\) fewer mistakes, even though per-mistake regret is bigger.


10. Motivation for UCB#

ETC weaknesses:

  • Needs \(\Delta_i\) to set \(m\).

  • Abrupt switch from explore to exploit.

  • All arms equally explored.

UCB addresses all three issues via optimism under uncertainty.


11. UCB – Main Idea#

In each round \(t\) compute an upper confidence bound (UCB) index for every arm:

\[ \text{UCB}_{i,t-1} = \underbrace{\hat{\mu}_{i,T_i(t-1)}}_{\text{empirical mean}} + \underbrace{\sqrt{\frac{2\log(1/\delta)}{T_i(t-1)}}}_{\text{exploration bonus}} \]

Pull arm

\[ A_t = \arg\max_i \text{UCB}_{i,t-1} \]

12. Confidence Bound Lemma#

For 1-sub-Gaussian rewards:

\[ \mathbb{P}\!\left(\forall s\ge 1,\ |\hat{\mu}_s - \mu| \le \sqrt{\frac{2\log(2/\delta)}{s}}\right) \ge 1-\delta \]

Hence choose

\[ \delta_t = \frac{1}{t^4} \quad\Rightarrow\quad \text{bonus} = \sqrt{\frac{2\log t}{T_i(t-1)}} \]

13. UCB Algorithm (Pseudo-code)#

The UCB algorithm operates as follows:

  1. Initialization: Take the number of arms k as input

  2. For each round t from 1 to n:

    • For each arm i:

      • If the arm has never been pulled before, set its UCB value to +∞

      • Otherwise, calculate its UCB value as: empirical mean \((μ̂_i)\) plus the exploration bonus \(\sqrt{\frac{2\log(1/\delta)}{T_i(t-1)}}\) , where \(T_i(t-1)\) is the number of times arm i has been pulled before round \(t\)

    • Select the arm with the highest UCB value \((A_t = argmax_i UCB_{i,t-1})\)

    • Pull the selected arm and observe the reward \(X_t\)

  3. Repeat this process until all n rounds are completed


14. Intuition – Why UCB Works#

  • Confidence intervals shrink as \(T_i\) grows.

  • Best arm’s UCB \(\approx\) its true mean (many pulls).

  • Sub-optimal arm stops being pulled when

\[ \sqrt{\frac{2\log t}{T_i(t-1)}} \le \frac{\Delta_i}{2} \quad\Rightarrow\quad T_i(t-1) \approx \frac{8\log t}{\Delta_i^2} \]

Hence pulls of arm \(i\) are inversely proportional to \(\Delta_i^2\).


15. Visualization – 3 Arms#

Means: \(\mu_1=0.8,\ \mu_2=0.7,\ \mu_3=0.6\)

Arm

\(\Delta_i\)

Typical pulls

1

0

\(\gg\log n\)

2

0.1

\(\propto\frac{\log n}{0.1^2}\)

3

0.2

\(\propto\frac{\log n}{0.2^2}\)

Confidence intervals shrink until their width matches the respective \(\Delta_i\).


16. Take-Away Summary#

Algorithm

Needs \(\Delta_i\)?

Regret Bound

Abrupt Switch?

ETC

Yes

\(O\!\left(\frac{k\log n}{\Delta}\right)\)

Yes

UCB

No

\(O\!\left(\sum_i\frac{\log n}{\Delta_i}\right)\)

No (smooth)

UCB achieves near-optimal regret while being fully adaptive.