Select a function for predicting by policy with approximation

Question

Select a function for predicting by policy with approximation

I am currently reading Sutton's introduction on strengthening learning. After arriving in Chapter 10 (Policy Based Prediction with Approximation), I am now wondering how to choose the shape of the function q

for which the optimal weights should be approximated w

.

I mean the first line of pseudo code below from Sutton: How do I pick a good differentiable function ? Are there standard strategies for choosing?

+3

reinforcement-learning approximation

FlashTek Jul 25 '17 at 9:15

source to share

1 answer

PraveenPalanisamy · Accepted Answer · 2017-07-25T18:58:03+0000

You can choose any function approximator that is differentiable. Two commonly used classes of value function approximators are:

Linear function approximators: linear combinations of functions

 For approximating Q (the action-value)
 1. Find features that are functions of states and actions.
 2. Represent q as a weighted combination of these features.

where is the vector in c specified and is the weight vector , the componenet is given .

Neural network

Imagine using a neural network. You can either zoom in using the action type (to the left of the picture below) or the type of action (to the right in the picture below). The difference is that a neural network can take both input representations of both state and action and produce a single value (Q value) as output, or take only a state representation as input s

and provide an output value for each action, a in the action space (this type is easier to implement if the action space is discrete and finite).

Using the first type (action-in) for the example, since it is close to the example in the linear case, you can create a Q-value approximator using a neural network with the following approach:
```
  Represent the state-action value as a normalized vector
  (or as a one-hot vector representing the state and action)
  1. Input layer : Size= number of inputs
  2. `n` hidden layers with `m` neurons
  3. Output layer: single output neuron
  Sigmoid activation function.
  Update weights using gradient descent as per the * semi-gradient Sarsa algorithm*.

      

        
        
        
      

    
```
You can also directly use visuals (if available) as inputs and use convolutional layers like in the DQN paper . But read the note below regarding convergence and additional tricks to stabilize such a nonlinear approximator method.

Graphically, the function approximator looks like this:

Note that this is an atomic function and is used to represent the elements of a state vector. You can use any elementary function instead . Some of them are Linear Regressors, Radial Base Functions , etc.

A good differentiable function is context sensitive. But in the settings for learning amplification, the properties of convergence and error estimation are important. Sars' episodic semi-gradient algorithm discussed in the book has similar convergence properties as TD (0) for constant policy.

Because you specifically specified a policy forecast, it is recommended that you use a linear function approximator because it is guaranteed to converge. Following are some other properties that make linear function approximators fit:

The error surface becomes a quadratic surface with one minimum, with a mean square error function. This makes it a robust solution as gradient descent is guaranteed to find minimums that are globally optimal.
Error bound (as proven by Tsitsiklis and Roy, 1997 for the general TD (lambda) case):

This means that the asymptotic error will be at most times the smallest possible error. Where is the discount factor. The gradient is easy to calculate!

Using a non-linear approximator (such as a (deep) neural network), however, does not guarantee convergence. The TD gradient method uses the true gradient of the predicted call error for updates instead of the half-gradient used in the Sars Episodic Semi-gradient algorithm, which is known to provide convergence even with nonlinear function approximators (even for out-of-policy prediction) if certain conditions are met.

Select a function for predicting by policy with approximation

More articles: