Select a function for predicting by policy with approximation

I am currently reading Sutton's introduction on strengthening learning. After arriving in Chapter 10 (Policy Based Prediction with Approximation), I am now wondering how to choose the shape of the function q

for which the optimal weights should be approximated w

.

I mean the first line of pseudo code below from Sutton: How do I pick a good differentiable function enter image description here? Are there standard strategies for choosing?

enter image description here

+3


source to share


1 answer


You can choose any function approximator that is differentiable. Two commonly used classes of value function approximators are:

  • Linear function approximators: linear combinations of functions

     For approximating Q (the action-value)
     1. Find features that are functions of states and actions.
     2. Represent q as a weighted combination of these features.
    
          

    enter image description here

    where phi_sais the vector in Rdc ispecified enter image description hereand wis the weight vector enter image description here, the ithcomponenet is given enter image description here.

  • Neural network

    Imagine qSAWusing a neural network. You can either zoom in using the action type (to the left of the picture below) or the type of action (to the right in the picture below). The difference is that a neural network can take both input representations of both state and action and produce a single value (Q value) as output, or take only a state representation as input s

    and provide an output value for each action, a in the action space (this type is easier to implement if the action space is discrete and finite).

    enter image description here

    Using the first type (action-in) for the example, since it is close to the example in the linear case, you can create a Q-value approximator using a neural network with the following approach:

      Represent the state-action value as a normalized vector
      (or as a one-hot vector representing the state and action)
      1. Input layer : Size= number of inputs
      2. `n` hidden layers with `m` neurons
      3. Output layer: single output neuron
      Sigmoid activation function.
      Update weights using gradient descent as per the * semi-gradient Sarsa algorithm*.
    
          

    You can also directly use visuals (if available) as inputs and use convolutional layers like in the DQN paper . But read the note below regarding convergence and additional tricks to stabilize such a nonlinear approximator method.


Graphically, the function approximator looks like this:

linearFA



Note that this varphi_eqphiis an atomic function and is xiused to represent the elements of a state vector. You can use any elementary function instead enter image description here. Some of them are Linear Regressors, Radial Base Functions , etc.

A good differentiable function is context sensitive. But in the settings for learning amplification, the properties of convergence and error estimation are important. Sars' episodic semi-gradient algorithm discussed in the book has similar convergence properties as TD (0) for constant policy.

Because you specifically specified a policy forecast, it is recommended that you use a linear function approximator because it is guaranteed to converge. Following are some other properties that make linear function approximators fit:

  • The error surface becomes a quadratic surface with one minimum, with a mean square error function. This makes it a robust solution as gradient descent is guaranteed to find minimums that are globally optimal.
  • Error bound (as proven by Tsitsiklis and Roy, 1997 for the general TD (lambda) case):

    enter image description here

    This means that the asymptotic error will be at most enter image description heretimes the smallest possible error. Where gammais the discount factor. The gradient is easy to calculate!

Using a non-linear approximator (such as a (deep) neural network), however, does not guarantee convergence. The TD gradient method uses the true gradient of the predicted call error for updates instead of the half-gradient used in the Sars Episodic Semi-gradient algorithm, which is known to provide convergence even with nonlinear function approximators (even for out-of-policy prediction) if certain conditions are met.

+3


source







All Articles