6. Thompson Sampling

6. Thompson Sampling#

Multi-armed Bandits: Core Concepts and Algorithms#

UCB Algorithm
Bayesian Learning Framework
Thompson Sampling

Fundamentals of Multi-armed Bandits#

Problem Definition: Sequential decision-making model with multiple “arms” (options/strategies)
Goal: Maximize cumulative reward (or minimize “regret”)
Key Challenge: Exploration-Exploitation Tradeoff
- Exploitation: Choose arms with highest observed reward
- Exploration: Try other arms to gain more information

UCB (Upper Confidence Bound) Algorithm#

Balances exploration and exploitation using confidence intervals
Core idea: Select arm with highest upper confidence bound of reward mean
- Considers current average reward (exploitation)
- Accounts for estimation uncertainty (exploration)

UCB Decision Rule#

At round $t$, select arm: $$A_t = \arg\max_i \left( \hat{\mu}_i(t-1) + \sqrt{\frac{4 \log n}{T_i(t-1)}} \right)$$

$A_t$: Selected arm at round $t$
$\hat{\mu}_i(t-1)$: Average reward of arm $i$ up to $t-1$ rounds
$T_i(t-1)$: Number of times arm $i$ was selected up to $t-1$ rounds
$n$: Total number of rounds
Second term: Uncertainty penalty (larger for less explored arms)

Regret Analysis for UCB#

Regret is defined as: $$R(n) = \sum_{t=1}^n (\mu^* - \mu_{A_t})$$

$\mu^*$: True mean of optimal arm ($\mu^* = \max_i \mu_i$)
$\mu_{A_t}$: True mean of selected arm at round $t$

######## Suboptimal Arm Contribution For suboptimal arm $i$ ($\mu_i < \mu^*$), define gap $\Delta_i = \mu^* - \mu_i$

UCB regret upper bound: $$R(n) \leq \sum_{i: \Delta_i > 0} \frac{16 \log n}{\Delta_i^2}$$

######## Bound Justification

Intuition: Limit selection frequency of suboptimal arms
When $T_i$ (selection count) satisfies: $$2\sqrt{\frac{4 \log n}{T_i}} \leq \Delta_i$$
Suboptimal arm $i$ will no longer be selected
Solving gives $T_i \leq \frac{16 \log n}{\Delta_i^2}$

Asymptotically Optimal UCB#

For large $n$, regret approximates: $$R(n) \approx \sum_{i: \Delta_i > 0} \frac{2 \log n}{\Delta_i^2}$$

Achieves theoretical lower bound for multi-armed bandits
Near-ideal performance as $n \to \infty$

Bayesian Learning Framework#

Contrasts with frequentist UCB approach
Models uncertainty through probability distributions
Updates beliefs using observed data
Selects strategies to minimize expected loss

Uncertainty Components#

Environment: Unknown scenario ($v$) - e.g., reward distributions of arms
Policy: Decision rule ($\pi$) - e.g., arm selection method
Loss Function: Measures policy performance in environment

Policy Properties#

Dominance: Policy $\pi_1$ is dominated by $\pi_2$ if:
- $\pi_1$ has greater or equal loss in all environments
- $\pi_1$ has strictly greater loss in at least one environment
Admissibility (Pareto Optimality):
- Not dominated by any other policy
- Minimizes loss in at least one environment

Bayesian Decision Core Idea#

Prior distribution $q(v)$: Initial belief about environment $v$
Choose policy minimizing expected loss: $$\pi_{\text{Bayes}} = \arg\min_{\pi} \mathbb{E}_v [\text{loss}(\pi, v)]$$
Expectation $\mathbb{E}_v$ computed using prior $q(v)$

Sequential Bayesian Learning#

Update beliefs after each observation: $$p(v | \text{data}) = \frac{p(\text{data} | v) \cdot q(v)}{p(\text{data})}$$

$p(\text{data} | v)$: Likelihood of data under environment $v$
$p(\text{data})$: Total probability (normalization constant)
Updates from prior to posterior distribution with new data

Thompson Sampling Algorithm#

Bayesian approach to multi-armed bandits
Achieves exploration-exploitation via stochastic sampling
Selects arm with highest sample from posterior distributions

Core Idea#

Maintain posterior distribution $F_i(t)$ for each arm $i$’s mean $\mu_i$
Sample $\hat{\mu}_i^{(s)}$ from each $F_i(t)$
Select arm with maximum sample: $A_t = \arg\max_i \hat{\mu}_i^{(s)}$
Update posterior of selected arm using observed reward

Algorithm Steps#

Input: Prior cumulative distribution functions (CDFs) $F_i(0)$ for each arm
For each round $t=1,2,...,n$:
- Sample $\hat{\mu}_i^{(s)}$ from $F_i(t-1)$ for each arm $i$
- Select $A_t = \arg\max_i \hat{\mu}_i^{(s)}$
- Observe reward $X_t$ and update: $$F_{A_t}(t) = \text{UPDATE}(F_{A_t}(t-1), X_t)$$
Repeat until completion

Update Rule (Gaussian Example)#

For rewards ~ $\mathcal{N}(\mu_i, 1)$:

After $T_i(t)$ selections with rewards $x_1,...,x_{T_i(t)}$
Sample mean: $\hat{\mu}_i = \frac{1}{T_i(t)} \sum_{k=1}^{T_i(t)} x_k$
Posterior distribution: $\mathcal{N}(\hat{\mu}_i, \frac{1}{T_i(t)})$
Variance decreases with more selections (increased certainty)

Performance Analysis#

Asymptotic Optimality in Gaussian bandits: $$R(n) \approx \sum_{i: \Delta_i > 0} \frac{2 \log n}{\Delta_i^2}$$
- Matches theoretical lower bound
Characteristics:
- Exploration through random sampling
- Often outperforms UCB in experiments
- Higher performance variance (more result fluctuation)

Supplementary Derivations#

Gaussian Sample Mean Properties#

For i.i.d. Gaussian variables $x_1,...,x_t$ ~ $\mathcal{N}(\mu, \sigma^2)$:

Sample mean: $\hat{\mu}(t) = \frac{x_1+...+x_t}{t}$
Expectation: $\mathbb{E}[\hat{\mu}(t)] = \mu$ (unbiased)
Variance: $\text{Var}(\hat{\mu}(t)) = \frac{\sigma^2}{t}$ (decreases with $t$)

Bayesian Formula Application#

Example: Bag containing balls (WW or WB)

Prior: $P(WW) = P(WB) = 0.5$
After observing “white ball”:
- Likelihoods: $P(\text{white}|WW)=1$, $P(\text{white}|WB)=0.5$
- Total probability: $P(\text{white}) = 0.5 \cdot 1 + 0.5 \cdot 0.5 = 0.75$
- Posterior: $P(WW | \text{white}) = \frac{1 \cdot 0.5}{0.75} = \frac{2}{3}$

Summary#

Aspect	UCB Algorithm	Thompson Sampling
Approach	Frequentist (confidence intervals)	Bayesian (posterior sampling)
Selection	Deterministic (max upper bound)	Stochastic (max sample)
Regret	Provable bounds, stable performance	Asymptotically optimal, better average performance
Variance	Lower	Higher
Use Case	Stability-critical scenarios	Average performance priority

Both balance exploration-exploitation to minimize regret