# Summary Table

Topic | Algorithm | Input | Output |
---|---|---|---|

MDP | Policy Evaluation | $S$, $A$, $T$, $R$, $\pi$ | $V_\pi$, $Q_\pi$ |

MDP | Policy Iteration / Value Iteration | $S$, $A$, $T$, $R$ | $\pi_\Mr{opt}$, $V_\Mr{opt}$, $Q_\Mr{opt}$ |

RL | Model-free Monte Carlo / SARSA | $S$, $A$, $\pi$, simulation by $\pi$ | $V_\pi$, $Q_\pi$ |

RL | Model-based Monte Carlo | $S$, $A$, simulation by any $\pi$ | $T$, $R$ (becomes MDP → can now find $\pi_\Mr{opt}$, $V_\Mr{opt}$, $Q_\Mr{opt}$) |

RL | Q-Learning | $S$, $A$, simulation by any $\pi$ | $\pi_\Mr{opt}$, $V_\Mr{opt}$, $Q_\Mr{opt}$ |

# Markov Decision Process

## Setup

** Definition** A

**Markov decision process**(

**MDP**) consists of

- $S$ = states, with start state $s_{\text{start}}\in S$
- $A(s)$ = actions from state $s$
- $T(s,a,s')$ = probability of $s'$ if take action $a$ in state $s$
- $0 \leq \gamma \leq 1$ = discount factor
- $R(s,a,s')$ = reward of the transition $(s,a,s')$
- $\Mr{IsEnd}(s)$ = whether end of game

** Definition** A

**policy**$\pi$ is a deterministic map from each state $s\in S$ to an action $a\in A$.

** Definition** The

**utility**of a random path generated by $\pi$ is the discounted sum of the rewards: for a path (

**episode**) $$s_0, a_1, r_1, s_1, a_2, r_2, s_2, \dots$$ the utility is $$u = r_1 + \gamma r_2 + \gamma^2 r_3 + \dots$$ The

**value**of $\pi$ is the expected utility.

** Definition** For a given $\pi$,

- $V_\pi(s)$ (
**value**) = expected utility by following $\pi$ from state $s$ (The term "value" is overloaded.) - $Q_\pi(s,a)$ (
**Q-value**) = expected utility by taking action $a$ from state $s$ and then following $\pi$

## Policy Evaluation

** Goal** Given an MDP $(S,A,T,R)$ and a policy $\pi$, compute $V_\pi(\cdot)$

** Recurrence**
$$V_\pi(s) = \cases{
0 & \Mr{IsEnd}(s) \\
Q_\pi(s,\pi(s)) & \cotherw}$$
$$Q_\pi(s,a) = \sum_{s'}T(s,a,s')\crab{R(s,a,s') + \gamma V_\pi(s')}$$

** Algorithm** (

**Policy Evaluation**)

- Initialize $V_\pi(s) \gets 0$ for all $s$
- Until convergence:

- For each $s \in S$, Update $$V_\pi(s) \gets \sum_{s'}T(s,\pi(s),s')\crab{R(s,\pi(s),s') + \gamma V_\pi(s')}$$ where the term in the sum is just the current estimate of $Q_\pi(s,\pi(s))$.

## Policy Optimization

** Goal** Given an MDP $(S,A,T,R)$, find a policy $\pi$ that maximizes the value.

We give 2 algorithms: **Policy Iteration** and **Value Iteration**

** Algorithm** (

**Policy Iteration**) Update $\pi$ directly.

- Initialize $\pi$ arbitrarily
- Until convergence:

- Run
Policy Evaluationto compute $V_\pi$- Compute $Q_\pi$ from $V_\pi$ using the recurrence
- For each $s\in S$, update $$\pi(s) \gets \argmax_{a\in A(s)} Q_\pi(s,a)$$

** Algorithm** (

**Value Iteration**) Optimizes the optimal value $V_\Mr{opt}(s)$, and then derive $\pi_\Mr{opt}$ from $V_\Mr{opt}(s)$.

- Initialize $V_\Mr{opt}(s) = 0$ for all $s$
- Until convergence:

- For each state $s$, update $$V_\Mr{opt}(s) = \max_{a\in A(s)} \sum_{s'}T(s,a,s')\crab{R(s,a,s') + \gamma V_\Mr{opt}(s')}$$ where the term inside max is just the current estimate of $Q_\Mr{opt}(s,a)$.

** Theorem** If either $\gamma < 1$ or MDP graph is acyclic, then both algorithms above will converge to the correct answer. (These are sufficient but not necessary conditions.)

# Reinforcement Learning

## Setup

If we don't know $T$ and $R$ (i.e., we know only $S$ and $A$), then we get **reinforcement learning**.

Since we don't know $T$ and $R$, we have to "learn" them by having an agent interacting with the MDP. The general algorithm template is

- For each iteration $t$:

- Let the agent choose an action $a_t = \pi_\Mr{act}(s_{t-1})$
somehow- Receive reward $r_t$ and new state $s_t$
- Update parameters
somehow

## On-Policy Methods

** Goal** Given simulations
$$s_0, a_1, r_1, s_1, a_2, r_2, s_2, \dots, a_n, r_n, s_n$$
from a specific policy $\pi$, estimate $Q_\pi(s,a)$.

One way is to compute the expected utility from the simulations: $$u_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \dots$$ $$\hat Q_\pi(s,a) = \text{average of } u_t \text{ where } (s_{t-1}, a_t) = (s,a)$$ Equivalently, we can use the following algorithm to compute $\hat Q_\pi(s,a)$:

** Algorithm** (

**Model-free Monte Carlo**)

- For each $(s,a,u)$ from the simulation:

- Let $$\eta = \frac{1}{1 + \text{number of updates to }(s,a)}$$
- Update using interpolation $$\hat Q_\pi(s,a) \gets (1-\eta)\hat Q_\pi(s,a) + \eta u$$ or equivalently, update using "gradient" update $$\hat Q_\pi(s,a) \gets \hat Q_\pi(s,a) - \eta\crab{\hat Q_\pi(s,a) - u}$$

In **Model-free Monte Carlo**, we have to compute $u$, which is expensive. As an alternative, **SARSA** approximates $u$ using the current reward plus the estimate of future reward.

** Algorithm** (

**SARSA**)

- For each $(s,a,r,s',a')$ from the simulation:

- Let $$\eta = \frac{1}{1 + \text{number of updates to }(s,a)}$$
- Update using interpolation $$\hat Q_\pi(s,a) \gets (1-\eta)\hat Q_\pi(s,a) + \eta\crab{r + \gamma \hat Q_\pi(s',a')}$$

## Off-Policy Methods

** Goal** Given simulations
$$s_0, a_1, r_1, s_1, a_2, r_2, s_2, \dots, a_n, r_n, s_n$$
from some policy, find the optimal policy and values.

There are 2 ways to do this. One way is **model-based Monte Carlo**, which recovers the MDP before computing $\pi_\Mr{opt}$ and $V_\Mr{opt}$. The $T$ and $R$ of the MDP can be approximated as
$$\hat T(s,a,s') = \frac{\Mr{count}(s,a,s')}{\Mr{count}(s,a)}$$
$$\hat R(s,a,s') = \text{average of } r \text{ in sequences } (s,a,r,s')$$
Then we can use value iteration or policy iteration to find the optimal policy and values.

The other way, **Q-learning**, approximates the optimal values directly without finding $T$ and $R$.

** Algorithm** (

**Q-Learning**)

- For each $(s,a,r,s')$:

- Update $$\hat Q_\Mr{opt}(s,a) \gets (1-\eta)\hat Q_\Mr{opt}(s,a) + \eta\crab{r + \gamma \hat V_\Mr{opt}(s')}$$ where $$\hat V_\Mr{opt}(s') = \max_{a'\in A(s')} \hat Q_\Mr{opt}(s',a')$$

Again, $V_\Mr{opt}$ and $\pi_\Mr{opt}$ can be recovered from $Q_\Mr{opt}$.

## Exploration vs. Exploitation

Consider the policy $\pi_\Mr{act}$, which is used to generate simulations.

If we use $\pi_\Mr{act} = \pi$ to generate all simulations (**exploitation**), we may not be covering all states. To discover new states, we should make our agent perform random actions from time to time (**exploration**). But if we are too random, we may not get good states very often.

One common solution is **epsilon-greedy policy**:
$$\pi_\Mr{act}(s) = \cases{\argmax_{a\in A(s)}\hat Q_\Mr{opt}(s,a) & \text{with probability } 1-\epsilon \\ \text{random action from } A(s) & \text{with probability } \epsilon}$$

## Function Approximation

To generalize better on large state spaces, we can make an assumption that
$$\hat Q_\Mr{opt}(s,a;w) = w\cdot \phi(s,a)$$
for some weight vector $w$ and feature function $\phi$. This is called **function approximation**. Q-learning becomes gradient descent:

- For each $(s,a,r,s')$:

- Update $$w \gets w - \eta\crab{\hat Q_\Mr{opt}(s,a;w) - \nail{r + \gamma\hat V_\Mr{opt}(s')}}\phi(s,a)$$ where $$\hat V_\Mr{opt}(s') = \max_{a'\in A(s')} \hat Q_\Mr{opt}(s',a')$$

This is basically gradient descent on $w$.

The function approximation can be replaced with any prediction function $f:S\times A \to \RR$ (e.g., neural network).

# Reference

- CS221 slides