Abstract
We present a continuous-time master-equation formulation of reinforcement learning. Both non-associative (stochastic learning automaton) and associative (neural network) cases are considered. A Fokker–Planck equation for the stochastic dynamics of the learning process is derived using a small-fluctuation expansion of the master equation. We then show how the Fokker–Planck approximation can be used to determine the global asymptotic behaviour of ergodic learning schemes such as linear reward–penalty (LR−P) and associative reward–penalty (LR−P), in the limit of small learning rates. A simple example of reinforcement learning in a non-stationary environment is studied.