Policy Iteration and Value Iteration

In reinforcement training, I try to understand the difference between policy iteration and value iteration. There are some general answers to this, but I have two specific requests that I cannot find an answer to.

1) I've heard that policy iteration "works forward" whereas value iteration "works downside". What does it mean? I thought both methods would just take each state and then look at all the other states it can achieve and calculate the value from that - either by marginalizing the policy action distribution (policy iteration), or by taking that argument about the action value (iteration of values). So why is there a concept of "direction" in which each method "moves"?

2) Policy iteration requires an iterative process during policy evaluation to find the value function - however, it only takes one step to iterate the value. Why is it different? Why does iterating over values ​​converge in just one step?

Thank!

+3


source to share


2 answers


The answer provided by @Nick Walker is correct and quite complete, however I would like to add a graphical explanation of the difference between Value Iteration and Policy Iteration which might help answer the second part of your question.

Both PI and VI methods follow the same operating principle based on Generalized Policy Iteration . This basically means that they alternate between improving the policy (which requires knowledge of its value function) and calculating the cost function of a new, improved policy.

enter image description here

At the end of this iterative process, both values ​​and policies converge towards the optimum.



However, it has been noted that it is not necessary to accurately compute the full value function, instead one step is needed to ensure convergence. The following figure (b) shows the operations performed by Policy Iteration, where the full value function is calculated. So far (d) shows how Value Iteration works.

enter image description here

Obviously, this presentation of both methods is simplistic, but it highlights the difference between the core idea of ​​each algorithm.

+4


source


I've heard that policy iteration "works forward" and value iteration "works backward". What does it mean?

I cannot find anything on the internet that describes policy iteration and the meaning of iteration in terms of direction, and as far as I know, this is not a general way to explain the difference between the two.

One possibility is that someone was referring to a visual representation of values ​​propagating in the value iteration. After the first sweep, the values ​​are correct at 1 time horizon. Each value correctly tells you what to do to maximize your cumulative reward if you have 1 tilmestep to live. This means that this means that the transition to the terminal state and receiving the reward have positive values, and most of the rest are 0. Each swing, the values ​​become correct for one time sign longer than the horizon. Thus, the values ​​shift back from the terminal state to the original state as the horizon expands. In policy iteration, instead of just passing values ​​one step at a time, you compute the full value function for the current policy.Then you improve the policy and repeat it. I can't say this has a front shade, but it certainly doesn't have a reverse look. You may want to seeSee Pablo's answer to a similar question for another explanation of the differences that might help you contextualize what you've heard.

It is also possible that you have heard of this opposite reverse contrast in relation to something related but different; implementation of time-difference learning algorithms. In this case, the direction refers to the direction you are looking in when the state values ​​are updated; forwards means that you need to have information about the results of future actions, and back means that you only need information about what happened earlier. You can read about this in Chapter 12, Strengthening Learning: An Introduction, 2nd Edition .



Why does the policy iteration have to do a bunch of value function computations when the value iteration only seems to do the one that ends up being optimal? Why does iterating over values ​​converge in just one step?

In evaluating a policy, we already have a policy and we are simply calculating the value of the actions that it dictates. It looks at each state many times and shifts the value of the state towards the values ​​of the states, to which the political action will act (until the values ​​stop changing and we consider them converging). This value function is not optimal. This is useful only because we can use it in combination with the policy improvement theorem to improve policy. The costly process of extracting new policies that requires us to maximize government action is rare, and politics seems to converge fairly quickly. So while the policy evaluation step looks like it will take a long time, PI is actually quite fast.

A value iteration is simply a policy iteration in which you do exactly one iteration of policy evaluation and retrieve a new policy at the same time (maximizing over actions is implicit policy retrieval). Then you repeat this iterate-extract procedure over and over until the values ​​stop changing. The fact that these steps are bundled together makes it simpler on paper, but maximizing on each sweep is costly and means that value iteration is often slower than policy iteration.

+2


source







All Articles