Artifactum learning algorithms for continuous states, discrete actions

I am trying to find the optimal policy in an environment with continuous states (dim. = 20) and discrete actions (3 possible actions). And there is a specific point: for an optimal policy, one action (call it "action 0") should be chosen much more often than the other two (about 100 times more often, these two actions are more risky).

I have tried Q-training with an approximation of the NN cost value. The results were pretty bad: NN learns to always choose "action 0". I think policy gradient methods (on NN weights) can help, but do not understand how to use them in discrete actions.

Could you advise on what to try? (possibly algorithms, documents to read). What are modern RL-algorithms when the state space is continuous and the action space is discrete?

Thank.

+3


source to share


1 answer


Applying Q-learning in continuous (state and / or action) spaces is not trivial. This is especially true when you are trying to combine Q learning with a global function approximator like NN (I understand that you are referring to the generic multilayer perceptron and backpropagation algorithm). You can read more on the Rich Sutton page . A better (or at least simpler) solution is to use local approximators such as Radial Basis Function networks (there is a good explanation as to why in section 4.1 of this document ).

On the other hand, the dimension of your state space is probably too large to use local approximators. Thus, my recommendation is to use other algorithms instead of Q-learning. A very competitive algorithm for continuous states and discrete actions Fixed Q Iteration , which is usually combined with tree methods to approximate a Q-function.



Finally, it is common practice when the number of actions is small, as in your case, you need to use an independent approximator for each action, i.e. instead of a unique approximator that takes a pair of states as input and returns a Q value using three approximators, one for each action, that only take state as input. You can find an example of this in Example 3.1 of the book Strengthening Learning and Dynamic Programming Using Function Approximators.

+7


source







All Articles