2.Stochastic Multi-Armed Bandits

2.Stochastic Multi-Armed Bandits#

Session 2: Stochastic Multi-Armed Bandits#

Exploration and Exploitation#

Dr. Fangli Ying
ECUST Shanghai China Email: yfangli@ecust.edu.cn

Table of Contents#

Basic Concepts of Multi-armed Bandits
Probability Basics
Different Bandit Problem Formulations
Stochastic Stationary Bandits
Regret: Definition and Decomposition
Explore-Then-Commit (ETC) Algorithm
Exploration-Exploitation Tradeoff
UCB Motivations

1. Basic Concepts of Multi-armed Bandits#

Problem Definition: A sequential game between a learner and an environment with uncertainty in decision outcomes.
Horizon: The game is played over $n$ rounds ($t = 1, 2, \dots, n$).
Actions & Rewards: In each round $t$, the learner chooses an action $A_t$ from $k$ possible actions, and receives a random reward $X_t$.
Objective: Maximize the cumulative reward:
$$ \sum_{t=1}^n X_t = X_1 + X_2 + \dots + X_n $$

1. Basic Concepts (Cont.)#

Regret: The reward lost by taking suboptimal decisions. Defined as:
$$ \text{Regret} = \sum_{t=1}^n \mu^* - \sum_{t=1}^n \mu(A_t) $$ where $\mu^*$ is the largest mean reward among all arms, and $\mu(a)$ is the mean reward of arm $a$.
Exploration vs. Exploitation:
- Exploration: Gain information by selecting all actions to learn their rewards.
- Exploitation: Choose the action with the highest observed reward to maximize immediate reward.

1. Basic Concepts (Cont.)#

Key Note: Multi-armed bandits do not change the environment or reward distributions.
Relationships:
- A special case of Reinforcement Learning (RL).
- Falls under Online Machine Learning (data is obtained on the go).
- RL differs: Actions may change the environment and reward distributions.

Bandits vs. Reinforcement Learning#

Multi-armed Bandits	Reinforcement Learning
Static reward distributions	Actions change environment
No state transitions	Stateful decision processes
Special case of RL	General framework

Bandits fall under Online ML: Decisions made sequentially with streaming feedback

2. Probability Basics#

Probability Space: Defined by 3 components:
- Sample space $\Omega$: Set of all possible outcomes.
- $\sigma$-algebra $\mathcal{F}$: Collection of subsets of $\Omega$ (events).
- Probability measure $P: \mathcal{F} \to [0,1]$ assigning probabilities to events.
Example (Coin Toss):
$\Omega = \{H, T\}$, $\mathcal{F} = \{\emptyset, \{H\}, \{T\}, \{H,T\}\}$
$P(H) = P(T) = \frac{1}{2}$, $P(\emptyset) = 0$, $P(\{H,T\}) = 1$

Formal Definition of Random Experiments#

Sample Space ($\Omega$):
Set of all possible outcomes
Example: Coin toss → $\Omega = \{H, T\}$

Event Space ($\mathcal{F}$):
Subsets of $\Omega$ representing measurable events

Probability Measure ($P$):
$P: \mathcal{F} \rightarrow [0,1]$ satisfying:

$P(\Omega) = 1$
Countable additivity

2. Probability Basics (Cont.)#

Independence of Events: Two events $E_1, E_2$ are independent if:
$$ P(E_1 \cap E_2) = P(E_1) \cdot P(E_2) $$
Conditional Probability: Probability of $E_1$ given $E_2$:
$$ P(E_1 | E_2) = \frac{P(E_1 \cap E_2)}{P(E_2)} $$ Note: If $E_1, E_2$ are independent, $P(E_1 | E_2) = P(E_1)$.

Random Variables & Distributions#

Formal Definition#

Random Variable: Function $X: \Omega \rightarrow \mathbb{R}$
Assigns numerical values to outcomes

Example (Coin Toss):
$X = \begin{cases} 0 & \text{if Tails} \\ 1 & \text{if Heads} \end{cases}, \quad Y = \begin{cases} 5 & \text{if Tails} \\ 10 & \text{if Heads} \end{cases}$

2. Probability Basics (Cont.)#

Random Variable: A mapping $X: \Omega \to \mathbb{R}$ assigning a real number to each outcome.
- Discrete RV: Takes countable values. Probability mass function (PMF) $P(X = x_k) = p_k$, with $\sum p_k = 1$.
- Continuous RV: Takes values in an uncountable set (e.g., $[0,1]$). Probability density function (PDF) $f_X(x)$, where $P(X \in B) = \int_B f_X(x) dx$.

2. Probability Basics (Cont.)#

Expected Value: For a random variable $X$:
- Discrete: $\mathbb{E}[X] = \sum_k x_k \cdot p_k$
- Continuous: $\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f_X(x) dx$
Linearity of Expectation: For random variables $X_1, X_2, \dots$:
$$ \mathbb{E}\left[\sum_{i=1}^{\infty} X_i\right] = \sum_{i=1}^{\infty} \mathbb{E}[X_i] $$

3. Different Bandit Problem Formulations#

Stationary Bandits: Reward distributions of actions are fixed over time.
Non-stationary (Restless) Bandits: Reward distributions may change over time (but not based on actions).
Structured Bandits: Known structure in reward distributions:
- Linear Bandits: Reward of action $a$ is a linear function: $x_t = f_t(\theta, a)$ with unknown parameter $\theta$.
- Correlated Bandits: Rewards of different actions are correlated.

3. Different Bandit Problem Formulations (Cont.)#

Contextual Bandits:
- Before each round, the learner observes context (e.g., user demographics).
- Goal: Learn the best action for each context.
- Application: Personalized recommendations.

4. Stochastic Stationary Bandits#

Model:
- Set of actions $A$ with $|A| = k$.
- Each action $a$ has a reward distribution $P_a$.
- In round $t$, learner chooses $A_t \in A$, receives $X_t \sim P_{A_t}$.
Goal: Maximize the expected cumulative reward $\mathbb{E}[S_n]$, where $S_n = \sum_{t=1}^n X_t$.

4. Stochastic Stationary Bandits (Cont.)#

Environment Class: Set of all possible reward distributions $\{P_a: a \in A\}$ (e.g., Bernoulli with unknown $\mu_a$).
Mean Reward: For arm $a$, $\mu_a = \mathbb{E}[X_t | A_t = a]$.
- Discrete case: $\mu_a = \sum_j j \cdot P(\text{reward from } a = j)$.
Best Arm: Arm with the largest mean reward: $\arg\max_{a \in A} \mu_a$, with $\mu^* = \max_{a \in A} \mu_a$.

5. Regret: Definition and Decomposition#

Regret for Policy $\pi$:
$$ R_n(\pi) = \mathbb{E}\left[\sum_{t=1}^n \mu^* - \sum_{t=1}^n \mu(A_t)\right] $$
(Expected difference between optimal cumulative reward and policy’s reward.)
Suboptimality Gap: $\Delta_a = \mu^* - \mu_a$ (gap between best arm and arm $a$; $\Delta_a = 0$ for best arm).

5. Regret Decomposition#

Let $T_a(n)$ = number of times arm $a$ is chosen in first $n$ rounds.
Expected regret decomposition:
$$ R_n = \sum_{a \in A} \Delta_a \cdot \mathbb{E}[T_a(n)] $$
(Weighted sum of expected counts of each arm, weighted by their suboptimality gaps.)

5. Regret Performance#

Sublinear Regret: $R_n = o(n)$ (algorithm chooses best action almost always as $n \to \infty$).
Logarithmic Regret: $R_n = O(\log n)$ (optimal in most cases).
Achieved when $\mathbb{P}(\text{choosing suboptimal arm in round } t) \propto \frac{1}{t}$.

6. Explore-Then-Commit (ETC) Algorithm#

Idea: Separate exploration and exploitation phases.
Steps:
1. Exploration Phase: Play each arm a fixed number of times $m$.
2. Exploitation Phase: Commit to the arm with the largest average reward from exploration.
Example: With $k$ arms, each arm is explored $m$ times. From round $mk + 1$, the best observed arm is played.

6. ETC Regret Analysis#

Regret During Exploration:
$$ \sum_{a \in A} m \cdot \Delta_a $$ (Each arm is chosen $m$ times; regret per selection is $\Delta_a$.)
Regret During Exploitation:
If the wrong arm is chosen, regret is $(n - mk) \cdot \Delta_{\hat{a}}$, where $\hat{a}$ is the suboptimal arm selected.
Total Expected Regret: Sum of exploration and exploitation regrets.

7. Exploration-Exploitation Tradeoff#

Exploration: Necessary to learn reward distributions but incurs regret from suboptimal actions.
Exploitation: Maximizes immediate reward but may miss better arms.
ETC Tradeoff: Choose $m$ to balance exploration (more $m$ → better estimates) and exploitation (less $m$ → fewer suboptimal rounds in exploitation).

Regret Bound of ETC Process#

High-Probability Bounds#

For 1-subgaussian rewards, the empirical mean $\hat{\mu}_k(m)$ satisfies: $$ P\left( |\hat{\mu}_k(m) - \mu_k| \geq \epsilon \right) \leq 2 \exp\left( -\frac{m \epsilon^2}{2} \right) $$

With $m = \Omega\left( \frac{\log n}{\Delta_k^2} \right)$, the probability of misidentifying the optimal arm is negligible.

Regret Bound of ETC Process#

Theorem 6.1#

For ETC with $1 \leq m \leq n/K$ in 1-subgaussian bandits: $$ R_n \leq C \cdot \sum_{k=1}^K \frac{\log n}{\Delta_k} + O(K m) $$ where $C$ is a constant. Optimal $m$ balances exploration cost and misidentification risk.

Variants of ETC Policies: $m$ and $n$#

Exploration Parameter $m$#

$m$ controls exploration effort: Larger $m$ reduces misidentification risk but increases exploration regret.
Optimal $m$ depends on gaps $\Delta_k$: $m = \max\left\{ 1, \frac{\log n}{\Delta_k^2} \right\}$.

Expected regret of ETC for different $m$ values#

Expected Regret

Variants of ETC Policies: $m$ and $n$#

Anytime Policy and $n$#

For unknown horizon $n$, use doubling trick: Phase $i$ has $n_i = 2^i$ trials, with $m_i$ adapted to $n_i$. This ensures regret bounds hold for all $n$.

ETC for Unknown Horizon: The Doubling Trick#

Adapting Explore-Then-Commit When $n$ is Unknown#

Problem: Standard ETC Needs Known $n$#

ETC relies on knowing total rounds $n$ to set exploration trials $m$ (e.g., $m \propto \log n$)
If $n$ is unknown:
- Too small $m$ → misidentify optimal arm (high exploitation regret)
- Too large $m$ → waste rounds on exploration (high exploration regret)

Solution: The Doubling Trick#

Convert ETC into an anytime policy using phases with doubling lengths:

Phases with Doubling Lengths
- Phase $i = 1, 2, 3, ...$
- Trials per phase: $n_i = 2^i$
  (Phase 1: 2 trials, Phase 2: 4 trials, Phase 3: 8 trials, …)

Adapt Exploration per Phase
- For phase $i$, treat $n_i$ as current guess for $n$
- Set $m_i$ (exploration trials per arm) based on $n_i$:
  $m_i \propto \frac{\log n_i}{\Delta^2}$
  ($\Delta$ = smallest gap between best and suboptimal arms)

Run Sequential Phases
In each phase $i$:
- Explore: Try each arm $m_i$ times
- Exploit: Play best observed arm for remaining $n_i - K \cdot m_i$ trials
- Continue until experiment stops (unknown $n$)

Example: 2-Arm Bandit ($K=2$)#

Phase $i$	$n_i$ (trials)	$m_i$ (exploration/arm)	Explore Trials	Exploit Trials
1	2	1	2 (1/arm)	0
2	4	1	2 (1/arm)	2
3	8	2	4 (2/arm)	4

If experiment stops after 5 trials: uses Phase 1 (2 trials) + Phase 2 (3 trials)

Why It Works#

Guaranteed Calibration: For any unknown $n$, $n$ falls in phase $i$ where $n_i \geq n$
Balanced Regret: $m_i$ scales with phase length to avoid over/under-exploration
Anytime Property: Works for any stopping time without prior knowledge of $n$

Key Takeaway#

The doubling trick transforms ETC into a robust algorithm for unknown horizons by:

Using adaptive exploration per phase
Ensuring regret remains bounded across all possible $n$

Motivation of UCB Policy#

Limitation of ETC#

ETC’s fixed exploration phase leads to suboptimal regret for non-stationary or large $n$.

UCB Idea: Optimism Under Uncertainty#

Select the arm with the highest upper confidence bound: $$ UCB_t(k) = \hat{\mu}_k(t) + \sqrt{\frac{2 \log t}{T_k(t)}} $$

Balances exploration (uncertain arms with large bounds) and exploitation (high empirical means).

Motivation of UCB Policy#

Algorithm 3: UCB($\delta$)#

For $t = 1, ..., n$:
- Choose $A_t = argmax_k UCB_k(t-1, \delta)$
- Observe $X_t$ and update confidence bounds
End for

UCB adapts to data dynamically, achieving better regret bounds than ETC in many scenarios.

UCB vs ETC#

Aspect	ETC	UCB
Exploration	Fixed phase	Adaptive (ongoing)
Horizon Requirement	Known $n$	Works with unknown $n$
Regret Bound	$O(\sqrt{Kn \log n})$	$O(\sqrt{Kn \log n})$
Flexibility	Low (fixed $m$)	High (adapts to data)

Key Takeaways#

Multi-armed bandits balance exploration (learning) and exploitation (maximizing reward).
Regret measures performance against the optimal arm.
Different problem formulations (stationary, non-stationary, structured) require tailored algorithms.
ETC is a simple algorithm with clear exploration-exploitation separation, achieving sublinear regret.

2.Stochastic Multi-Armed Bandits

Contents

2.Stochastic Multi-Armed Bandits#

Session 2: Stochastic Multi-Armed Bandits#

Exploration and Exploitation#

Table of Contents#

1. Basic Concepts of Multi-armed Bandits#

1. Basic Concepts (Cont.)#

1. Basic Concepts (Cont.)#

Bandits vs. Reinforcement Learning#

2. Probability Basics#

Formal Definition of Random Experiments#

2. Probability Basics (Cont.)#

Random Variables & Distributions#

Formal Definition#

2. Probability Basics (Cont.)#

2. Probability Basics (Cont.)#

3. Different Bandit Problem Formulations#

3. Different Bandit Problem Formulations (Cont.)#

4. Stochastic Stationary Bandits#

4. Stochastic Stationary Bandits (Cont.)#

5. Regret: Definition and Decomposition#

5. Regret Decomposition#

5. Regret Performance#

6. Explore-Then-Commit (ETC) Algorithm#

6. ETC Regret Analysis#

7. Exploration-Exploitation Tradeoff#

Regret Bound of ETC Process#

High-Probability Bounds#

Regret Bound of ETC Process#

Theorem 6.1#

Variants of ETC Policies: \(m\) and \(n\)#

Exploration Parameter \(m\)#

Expected regret of ETC for different \(m\) values#

Variants of ETC Policies: \(m\) and \(n\)#

Anytime Policy and \(n\)#

ETC for Unknown Horizon: The Doubling Trick#

Adapting Explore-Then-Commit When \(n\) is Unknown#

Problem: Standard ETC Needs Known \(n\)#

Solution: The Doubling Trick#

Example: 2-Arm Bandit (\(K=2\))#

Why It Works#

Key Takeaway#

Motivation of UCB Policy#

Limitation of ETC#

UCB Idea: Optimism Under Uncertainty#

Motivation of UCB Policy#

Algorithm 3: UCB(\(\delta\))#

UCB vs ETC#

Key Takeaways#