Learning Q: Re-learning after environmental change

I have implemented Q-training on a grid of size (nxn) with a single reward of 100 in the middle. The agent learns in 1000 epochs in order to reach the goal by the following agency: he chooses with a probability 0.8 the move with the highest state action value and chooses a random move by 0.2. After the move, the state action value is updated with the Q learning rule.

Now I did the following experiment: all the fields next to the target received a -100 reward, except for the one below. After training for 1000 epochs, the agent clearly avoids walking the upper path and most often reaches the target from the bottom.

After training, set the reward of the lower neighbor to -100 and the upper neighbor back to 0 and start training again for 1000 epochs, sticking to the state value map. This is really awful! The agent must find the target for a very long time (up to 3 minutes on a 9x9 grid). After checking the paths I've seen, the agent jumps between two states for a long time, such as (0,0) β†’ (1,0) β†’ (0,0) β†’ (1,0) ...

It's hard for me to imagine if this behavior makes any difference. Does anyone come across a similar situation?

+3


source to share


4 answers


Q-learning is research dependent.

If you are using e-greedy and you have reduced epsilon significantly, it is unlikely that the agent will be able to adapt.



If your changes in state space are far from the trajectory followed by science policy, it can be difficult to reach these areas.

I would suggest that you take a look at your epsilon values ​​and how quickly you decrease them over time.

+2


source


I suppose more information will help me to be more angry, but what you describe is what I expect. The agent learned (and learned well) a specific path to the goal. Now you have changed this. My gut tells me that it will be harder for the agent than just moving the target because you have changed the way you want her to reach the target.



You can increase the randomness of the action selection policy for many iterations after moving the wall. This can reduce the time it takes for an agent to find a new path to a goal.

0


source


This is fairly typical of the standard Q-learning algorithm. As stated in Parallel Q-Learning: Reinforcing for Dynamic Purposes and Environments :

Artifact learning methods such as time difference have been shown to perform well on tasks including navigation to a fixed target. However, if the target's location moves, the previously acquired information prevents the task from finding the target's new location and performance accordingly.

There are, however, different algorithms, for example. the one described in the article above does much better in such a situation.

0


source


Can you provide a code? This behavior looks amazing to me.

Imho Agent must be able to wean previously acquired knowledge. And there shouldn't be something like "confidence" in reinforcement training. The grid looks like

00000
00--0
0-+-0
0---0
00000

      

in a last attempt. The probability of accidentally hitting a target on the shortest path 0.2*1/3 * (0.8+0.2*1/9)

. Basically a diagonal goes randomly and then goes down. Hence, algorithms must be slow to update the Q

state value (1,1)

. The actual update value of this value is 5%

. If your learning rate isn't too slow, it will eventually get updated. Note that any other path that reaches the target will slowly pull the other path towards zero.

You have stated that you are jumping between the first two states. This indicates that you do not have a discount rate. This can lead to a situation where the two states (0,0)

and (1,0)

have a pretty good meaning Q

, but this is "self-esteem". Alternatively, you might have forgotten to subtract the old value in the update function

0


source







All Articles