Reinforcement Learning / MDP (CS221)

Summary Table

Topic Algorithm Input Output
MDP Policy Evaluation $S$, $A$, $T$, $R$, $\pi$ $V_\pi$, $Q_\pi$
MDP Policy Iteration / Value Iteration $S$, $A$, $T$, $R$ $\pi_\Mr{opt}$, $V_\Mr{opt}$, $Q_\Mr{opt}$
RL Model-free Monte Carlo / SARSA $S$, $A$, $\pi$, simulation by $\pi$ $V_\pi$, $Q_\pi$
RL Model-based Monte Carlo $S$, $A$, simulation by any $\pi$ $T$, $R$ (becomes MDP → can now find $\pi_\Mr{opt}$, $V_\Mr{opt}$, $Q_\Mr{opt}$)
RL Q-Learning $S$, $A$, simulation by any $\pi$ $\pi_\Mr{opt}$, $V_\Mr{opt}$, $Q_\Mr{opt}$

Markov Decision Process


Definition A Markov decision process (MDP) consists of

  • $S$ = states, with start state $s_{\text{start}}\in S$
  • $A(s)$ = actions from state $s$
  • $T(s,a,s')$ = probability of $s'$ if take action $a$ in state $s$
  • $0 \leq \gamma \leq 1$ = discount factor
  • $R(s,a,s')$ = reward of the transition $(s,a,s')$
  • $\Mr{IsEnd}(s)$ = whether end of game

Definition A policy $\pi$ is a deterministic map from each state $s\in S$ to an action $a\in A$.

Definition The utility of a random path generated by $\pi$ is the discounted sum of the rewards: for a path (episode) $$s_0, a_1, r_1, s_1, a_2, r_2, s_2, \dots$$ the utility is $$u = r_1 + \gamma r_2 + \gamma^2 r_3 + \dots$$ The value of $\pi$ is the expected utility.

Definition For a given $\pi$,

  • $V_\pi(s)$ (value) = expected utility by following $\pi$ from state $s$ (The term "value" is overloaded.)
  • $Q_\pi(s,a)$ (Q-value) = expected utility by taking action $a$ from state $s$ and then following $\pi$

Policy Evaluation

Goal Given an MDP $(S,A,T,R)$ and a policy $\pi$, compute $V_\pi(\cdot)$

Recurrence $$V_\pi(s) = \cases{ 0 & \Mr{IsEnd}(s) \\ Q_\pi(s,\pi(s)) & \cotherw}$$ $$Q_\pi(s,a) = \sum_{s'}T(s,a,s')\crab{R(s,a,s') + \gamma V_\pi(s')}$$

Algorithm (Policy Evaluation)

  • Initialize $V_\pi(s) \gets 0$ for all $s$
  • Until convergence:
    • For each $s \in S$, Update $$V_\pi(s) \gets \sum_{s'}T(s,\pi(s),s')\crab{R(s,\pi(s),s') + \gamma V_\pi(s')}$$ where the term in the sum is just the current estimate of $Q_\pi(s,\pi(s))$.

Policy Optimization

Goal Given an MDP $(S,A,T,R)$, find a policy $\pi$ that maximizes the value.

We give 2 algorithms: Policy Iteration and Value Iteration

Algorithm (Policy Iteration) Update $\pi$ directly.

  • Initialize $\pi$ arbitrarily
  • Until convergence:
    • Run Policy Evaluation to compute $V_\pi$
    • Compute $Q_\pi$ from $V_\pi$ using the recurrence
    • For each $s\in S$, update $$\pi(s) \gets \argmax_{a\in A(s)} Q_\pi(s,a)$$

Algorithm (Value Iteration) Optimizes the optimal value $V_\Mr{opt}(s)$, and then derive $\pi_\Mr{opt}$ from $V_\Mr{opt}(s)$.

  • Initialize $V_\Mr{opt}(s) = 0$ for all $s$
  • Until convergence:
    • For each state $s$, update $$V_\Mr{opt}(s) = \max_{a\in A(s)} \sum_{s'}T(s,a,s')\crab{R(s,a,s') + \gamma V_\Mr{opt}(s')}$$ where the term inside max is just the current estimate of $Q_\Mr{opt}(s,a)$.

Theorem If either $\gamma < 1$ or MDP graph is acyclic, then both algorithms above will converge to the correct answer. (These are sufficient but not necessary conditions.)

Reinforcement Learning


If we don't know $T$ and $R$ (i.e., we know only $S$ and $A$), then we get reinforcement learning.

Since we don't know $T$ and $R$, we have to "learn" them by having an agent interacting with the MDP. The general algorithm template is

  • For each iteration $t$:
    • Let the agent choose an action $a_t = \pi_\Mr{act}(s_{t-1})$ somehow
    • Receive reward $r_t$ and new state $s_t$
    • Update parameters somehow

On-Policy Methods

Goal Given simulations $$s_0, a_1, r_1, s_1, a_2, r_2, s_2, \dots, a_n, r_n, s_n$$ from a specific policy $\pi$, estimate $Q_\pi(s,a)$.

One way is to compute the expected utility from the simulations: $$u_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \dots$$ $$\hat Q_\pi(s,a) = \text{average of } u_t \text{ where } (s_{t-1}, a_t) = (s,a)$$ Equivalently, we can use the following algorithm to compute $\hat Q_\pi(s,a)$:

Algorithm (Model-free Monte Carlo)

  • For each $(s,a,u)$ from the simulation:
    • Let $$\eta = \frac{1}{1 + \text{number of updates to }(s,a)}$$
    • Update using interpolation $$\hat Q_\pi(s,a) \gets (1-\eta)\hat Q_\pi(s,a) + \eta u$$ or equivalently, update using "gradient" update $$\hat Q_\pi(s,a) \gets \hat Q_\pi(s,a) - \eta\crab{\hat Q_\pi(s,a) - u}$$

In Model-free Monte Carlo, we have to compute $u$, which is expensive. As an alternative, SARSA approximates $u$ using the current reward plus the estimate of future reward.

Algorithm (SARSA)

  • For each $(s,a,r,s',a')$ from the simulation:
    • Let $$\eta = \frac{1}{1 + \text{number of updates to }(s,a)}$$
    • Update using interpolation $$\hat Q_\pi(s,a) \gets (1-\eta)\hat Q_\pi(s,a) + \eta\crab{r + \gamma \hat Q_\pi(s',a')}$$

Off-Policy Methods

Goal Given simulations $$s_0, a_1, r_1, s_1, a_2, r_2, s_2, \dots, a_n, r_n, s_n$$ from some policy, find the optimal policy and values.

There are 2 ways to do this. One way is model-based Monte Carlo, which recovers the MDP before computing $\pi_\Mr{opt}$ and $V_\Mr{opt}$. The $T$ and $R$ of the MDP can be approximated as $$\hat T(s,a,s') = \frac{\Mr{count}(s,a,s')}{\Mr{count}(s,a)}$$ $$\hat R(s,a,s') = \text{average of } r \text{ in sequences } (s,a,r,s')$$ Then we can use value iteration or policy iteration to find the optimal policy and values.

The other way, Q-learning, approximates the optimal values directly without finding $T$ and $R$.

Algorithm (Q-Learning)

  • For each $(s,a,r,s')$:
    • Update $$\hat Q_\Mr{opt}(s,a) \gets (1-\eta)\hat Q_\Mr{opt}(s,a) + \eta\crab{r + \gamma \hat V_\Mr{opt}(s')}$$ where $$\hat V_\Mr{opt}(s') = \max_{a'\in A(s')} \hat Q_\Mr{opt}(s',a')$$

Again, $V_\Mr{opt}$ and $\pi_\Mr{opt}$ can be recovered from $Q_\Mr{opt}$.

Exploration vs. Exploitation

Consider the policy $\pi_\Mr{act}$, which is used to generate simulations.

If we use $\pi_\Mr{act} = \pi$ to generate all simulations (exploitation), we may not be covering all states. To discover new states, we should make our agent perform random actions from time to time (exploration). But if we are too random, we may not get good states very often.

One common solution is epsilon-greedy policy: $$\pi_\Mr{act}(s) = \cases{\argmax_{a\in A(s)}\hat Q_\Mr{opt}(s,a) & \text{with probability } 1-\epsilon \\ \text{random action from } A(s) & \text{with probability } \epsilon}$$

Function Approximation

To generalize better on large state spaces, we can make an assumption that $$\hat Q_\Mr{opt}(s,a;w) = w\cdot \phi(s,a)$$ for some weight vector $w$ and feature function $\phi$. This is called function approximation. Q-learning becomes gradient descent:

  • For each $(s,a,r,s')$:
    • Update $$w \gets w - \eta\crab{\hat Q_\Mr{opt}(s,a;w) - \nail{r + \gamma\hat V_\Mr{opt}(s')}}\phi(s,a)$$ where $$\hat V_\Mr{opt}(s') = \max_{a'\in A(s')} \hat Q_\Mr{opt}(s',a')$$

This is basically gradient descent on $w$.

The function approximation can be replaced with any prediction function $f:S\times A \to \RR$ (e.g., neural network).


  • CS221 slides
Exported: 2016-07-13T02:05:22.238723