The exploration-exploitation trade-off in sequential decision making problems

Sykulski, Adam and Adams, Niall M. and Jennings, Nicholas R. (2011) The exploration-exploitation trade-off in sequential decision making problems. PhD thesis, Imperial College London.

Preview

PDF (ADAMTHESIS)
ADAMTHESIS.PDF
Download (1MB)

Abstract

Sequential decision making problems require an agent to repeatedly choose between a series of actions. Common to such problems is the exploration-exploitation trade-off, where an agent must choose between the action expected to yield the best reward (exploitation) or trying an alternative action for potential future benefit (exploration). The main focus of this thesis is to understand in more detail the role this trade-off plays in various important sequential decision making problems, in terms of maximising finite-time reward. The most common and best studied abstraction of the exploration-exploitation trade-off is the classic multi-armed bandit problem. In this thesis we study several important extensions that are more suitable than the classic problem to real-world applications. These extensions include scenarios where the rewards for actions change over time or the presence of other agents must be repeatedly considered. In these contexts, the exploration-exploitation trade-off has a more complicated role in terms of maximising finite-time performance. For example, the amount of exploration required will constantly change in a dynamic decision problem, in multi-agent problems agents can explore by communication, and in repeated games, the exploration-exploitation trade-off must be jointly considered with game theoretic reasoning. Existing techniques for balancing exploration-exploitation are focused on achieving desirable asymptotic behaviour and are in general only applicable to basic decision problems. The most flexible state-of-the-art approaches, ε-greedy and ε-first, require exploration parameters to be set a priori, the optimal values of which are highly dependent on the problem faced. To overcome this, we construct a novel algorithm, ε-ADAPT, which has no exploration parameters and can adapt exploration on-line for a wide range of problems. ε-ADAPT is built on newly proven theoretical properties of the ε-first policy and we demonstrate that ε-ADAPT can accurately learn not only how much to explore, but also when and which actions to explore.

Item Type:

Thesis (PhD)

Departments:

Faculty of Science and Technology > Mathematics and Statistics

ID Code:

87354

Deposited By:

ep_importer_pure

Deposited On:

14 Aug 2017 10:18

Refereed?:

Published?:

Published

Last Modified:

10 Dec 2025 14:21

URI:

https://eprints.lancs.ac.uk/id/eprint/87354