Basics of Reinforcement Learning

2023-03-24

Reinforcement Learning is my favourite machine learning area and in this post, I will teach you about the basics of Reinforcement Learning (RL) and about the basic notation as well as other formulations in RL. Let’s get started!

First, the general RL setting. In RL, we have an agent that interacts with an environment. The goal of the agent is to get as much reward as possible. At each time step, the agent can perform an action and receive a reward for that action. The figure below illustrates this.

RL Setting

When the agent interacts with the environment, it generates a so called trajectory. That trajectory is a list of state, action, reward tuples, i.e.:

τ=((s0,a0,r0),,(si,ai,ri),(sT,aT,rT))\tau = ((s_0, a_0, r_0), \ldots, (s_i, a_i, r_i), \ldots (s_T, a_T, r_T))

A trajectory starts from timestep t=0t=0 until TT which is the terminal timestep (i.e. the last timestep of the episode).

Given a trajectory, you can compute the return of the trajectory. How much return did you get from that trajectory? There are two types of returns you can compute: either the absolute return or the discounted return. The former is simply a sum of the rewards for every timestep:

R(τ)=t=0T=rtR(\tau) = \sum_{t=0}^T = r_t

This is also called a finite horizon undiscounted reward. For the other case, you discount the return by some factor γ[0,1]\gamma \in [0, 1] and it’s referred to as infinite horizon discounted return. It’s defined as

R(τ)=t=0=γtrtR(\tau) = \sum_{t=0}^\infty = \gamma^t r_t

At each timestep, the agent sees the current state and decides what action to take. The action can either be deterministic or stochastic. If it’s deterministic, its written as

at=μθ(st)a_t = \mu_\theta(s_t)

and if it’s stochastic, then the action is sampled from a distribution, i.e.

atπθ(st)a_t \sim \pi_\theta(s_t)

The θ\theta is there to emphasise that those policies, whether deterministic or stochastic, usually are parameterised functions, such as neural networks and θ\theta represents their parameters.

The action itself can either be discrete (e.g. press button A) or continious (e.g. actuate motor by 4.2). If it’s discrete, the logits given from the policy can be softmaxed to become probabilities and then simply sampled from. If the actions are continious, then the policy usually returns a Gaussian distribution (i.e. the mean and deviation). Then you simply use those to sample your action.

Let’s go back to our trajectory. What are the odds of getting a particular trajectory? It depends on the current policy of the agent of course, so let’s write that down:

P(τπθ)=p0(s0)t=0T1P(st+1at,st)πθ(atst)P(\tau|\pi_\theta) = p_0(s_0)\prod_{t=0}^{T-1}P(s_{t+1}|a_t, s_t)\pi_\theta(a_t|s_t)

The equation above can be read as “the probability of getting trajectory τ\tau given πθ\pi_\theta equals the probability of the initial state s0s_0 times the products of the probability to reach state st+1s_{t+1} given action ata_t and state sts_t times the probability to pick action ata_t given the current state sts_t .” As a useful shorthand, we can write sP(s,a)s' P(\cdot|s, a), which reads as sampling the next state from the environment, given the current state and action.

With this, we can now define our objective function

J(πθ)=TP(τπθ)R(τ)J(\pi_\theta) = \int_{\mathcal{T}}P(\tau|\pi_\theta)R(\tau)

The objective function is also called the expected return:

J(πθ)=TP(τπθ)R(τ)=Eτπθ[R(τ)]J(\pi\theta) = \int_T P(\tau|\pi\theta)R(\tau) = \mathbb{E}_{\tau \sim \pi\theta} [R(\tau)]

And we’re trying to maximise the expected return! By the way, the best possible policy (and also the value functions - we’ll get to those) is written with a star, i.e. π\pi^*

There are 2 more questions to answer:

The former is defined as the on-policy value function Vπ(s)V^\pi(s):

Vπ(s)=Eτπ[R(τ)s0=s]V^\pi(s) = \mathbb{E}_{\tau \sim \pi}[R(\tau)|s_0 = s]

It’s called on-policy because the agent stays on the policy, i.e. it doesn’t “leave” the policy; it stays there. The latter is called the on-policy action-value function Qπ(s,a)Q^\pi(s, a):

Qπ(s,a)=Eτπ[R(τ)s0=s,a0=a]Q^\pi(s, a) = \mathbb{E}_{\tau \sim \pi}[R(\tau)|s_0 = s, a_0 = a]

Like with the optimal policy, both of the value functions have an optimal function, i.e.

V(s)=maxπEτπ[R(τ)s0=s]V^*(s) = \max_\pi \mathbb{E}_{\tau \sim \pi}[R(\tau)|s_0 = s]

(which is the maximum expected return over all policies π\pi, given the initial state ss)

and

Q(s,a)=maxπEτπ[R(τ)s0=s,a0=a]Q^*(s, a) = \max_\pi \mathbb{E}_{\tau \sim \pi}[R(\tau)|s_0 = s, a_0 = a]

(which is the maximum expected return for taking action aa in state ss and thereafter following the best policy π\pi)

The optimal action-value function is a bit more useful, because we can simply do this to get the optimal action:

a(s)=argmaxaQ(s,a)a^*(s) = \arg \max_a Q^*(s, a)

The last bit of RL basics are the Bellman equations. These show a kind-of recursive property of the value functions, where the value of a state is the immediate return plus the discounted value of the next state. In other words:

Vπ(s)=Eaπ,sP[r(s,a)+γVπ(s)]V^\pi(s) = \mathbb{E}_{a \sim \pi, s' \sim P} [r(s, a) + \gamma V^\pi(s')] V(s)=Eaπ,sP[r(s,a)+γV(s)]V^*(s) = \mathbb{E}_{a \sim \pi, s' \sim P} [r(s, a) + \gamma V^*(s')]

And the same goes for the action-value function:

Qπ(s,a)=EsP[r(s,a)+γEaπ[Qπ(s,a)]],Q^\pi(s, a) = \mathbb{E}_{s' \sim P} [r(s, a) + \gamma \mathbb{E}_{a' \sim \pi} [Q^\pi(s', a')]] , Q(s,a)=EsP[r(s,a)+γmaxaQ(s,a)],Q^*(s, a) = \mathbb{E}_{s' \sim P} [r(s, a) + \gamma \max_{a'} Q^*(s', a')] ,

The last important part is the advantage function. It basically describes how good it is to pick action aa in state ss compared to the average action-value function (i.e. sampling actions from the policy). It’s defined as

Aπ(s,a)=Qπ(s,a)Vπ(s)A_\pi(s, a) = Q_\pi(s, a) - V_\pi(s)

And those are the basics of RL. In the next blog post, we will go one step further and implement REINFORCE (a policy gradient algorithm) and actually watch some agents in action! Stay tuned!