Multi-armed bandit

You don't need to be Editor-In-Chief to add or edit content to WikiDoc. You can begin to add to or edit text on this WikiDoc page by clicking on the edit button at the top of this page. Next enter or edit the information that you would like to appear here. Once you are done editing, scroll down and click the Save page button at the bottom of the page.

Jump to: navigation, search

A multi-armed bandit, also sometimes called a K-armed bandit, is a simple machine learning problem based on an analogy with a traditional slot machine (one-armed bandit) but with more than one lever. When pulled, each lever provides a reward drawn from a distribution associated to that specific lever. The objective of the gambler is to maximize the collected reward sum through iterative pulls. It is classically assumed that the gambler has no initial knowledge about the levers. The crucial tradeoff the gambler faces at each trial is between "exploitation" of the lever that has the highest expected payoff and "experimentation" to get more information about the expected payoffs of the other levers.

Contents

Empirical motivation

The multi-armed bandit problem, originally described by Robbins in 1952, is a simple model of an agent that simultaneously attempts to acquire new knowledge and to optimize its decisions based on existing knowledge. Practical examples include clinical trials where the effects of different experimental treatments need to be investigated while minimizing patient losses, and adaptive routing efforts for minimizing delays in a network. The questions arising in these cases are related to the problem of balancing reward maximization based on the knowledge already acquired with attempting new actions to further increase knowledge. This is known as the exploitation vs. exploration tradeoff in reinforcement learning.

The multi-armed bandit model

The multi-armed bandit (bandit for short) can be seen as a set of real distributions B = \{R_1, \dots ,R_K\}, each distribution being associated with the rewards delivered by one of the K levers. Let \mu_1, \dots , \mu_K be the mean values associated with these reward distributions. The gambler iteratively plays one lever per round and observes the associated reward. The objective is to maximize the sum of the collected rewards. The horizon H is the number of rounds that remain to be played. The bandit problem is formally equivalent to a one-state Markov decision process. The regret ρ after T rounds is defined as the difference between the reward sum associated with an optimal strategy and the sum of the collected rewards: \rho = T \mu^* - \sum_{t=1}^T \widehat{r}_t, where μ * is the maximal reward mean, μ * = maxkk}, and \widehat{r}_t is the reward at time t. A strategy whose average regret per round ρ / T tends to zero with probability 1 when the number of played rounds tends to infinity is a zero-regret strategy. Intuitively, zero-regret strategies are guaranteed to converge to an optimal strategy, not necessarily unique, if enough rounds are played.

Common bandit strategies

Many strategies exist which provide an approximate solution to the bandit problem, and can be put into the three broad categories detailed below.

Semi-uniform strategies

Semi-uniform strategies were the earliest (and simplest) strategies discovered to approximately solve the bandit problem. All those strategies have in common a greedy behavior where the best lever (based on previous observations) is always pulled except when a (uniformly) random action is taken.

  • Epsilon-greedy strategy: The best lever is selected for a proportion 1 − ε of the trials, and another lever is randomly selected (with uniform probability) for a proportion ε. A typical parameter value might be ε = 0.1, but this can vary widely depending on circumstances and predilections.
  • Epsilon-first strategy: A pure exploration phase is followed by a pure exploitation phase. For N trials in total, the exploration phase occupies εN trials and the exploitation phase (1 − ε)N trials. During the exploration phase a lever is randomly selected (with uniform probability); during the exploitation phase the best lever is always selected.
  • Epsilon-decreasing strategy: Similar to the epsilon-greedy strategy, except that the value of ε decreases as the experiment progresses, resulting in highly explorative behaviour at the start and highly exploitative behaviour at the finish.

Probability matching strategies

Probability matching strategies reflect the idea that the number of pulls for a given lever should match its actual probability of being the optimal lever.

Pricing strategies

Pricing strategies establish a price for each lever. The lever of highest price is always pulled.

References

  • H. Robbins. Some Aspects of the Sequential Design of Experiments. In Bulletin of the American Mathematical Society, volume 55, pages 527–535, 1952.
  • Richard Sutton and Andrew Barto. Reinforcement Learning. MIT Press, 1998. (available online)
  • Bandit project (bandit.sourceforge.net), open source implementation of many bandit strategies.

See Also

  • Gittins index — a powerful, general strategy for analyzing bandit problems.

WikiDoc Help Menu

Quick Start..

Editing basics

Advanced editing

Communicating your edits

Help Videos You Can Watch

Acknowledgement and Attribution Regarding Sources of Content

Some of the initial content on this page may be incorporated in part from copyleft sources in the public domain including wikis such as Wikipedia and AskDrWiki. Drug information for patients came from the The National Library of Medicine. Infectious disease information may have come from the Centers for Disease Control (CDC). Differential Diagnoses are drawn from clinicians as well as an amalgamation of 3 sources: 1.The Disease Database; 2. Kahan, Scott, Smith, Ellen G. In A Page: Signs and Symptoms. Malden, Massachusetts: Blackwell Publishing, 2004:3; 3. Sailer, Christian, Wasner, Susanne. Differential Diagnosis Pocket. Hermosa Beach, CA: Borm Bruckmeir Publishing LLC, 2002:7 .

Personal tools