sparse-dense by FoYo: Running Q-Learning on your palm (Part3)

Japanese summary　本シリーズのPart1とPart2では、スマホ向けのQ-Learningアプリを開発し、それを簡単な例（直線の廊下でロボットが宝石を得る）に適用しました。今回は、このアプリを改訂し、２次元グリッドでロボットが行動できるようにしました。そして、グリッドサイズが大きくなるにしたがい、これまでのQ-tableを保持してQ値を更新する方法は、メモリ量と処理量の急増により破綻することを確認しました。それに変わる有望な方法として、Neural Networkの利用を検討し、それを（スマホではなく）PC上のPythonで実現した結果を示します。

Abstract
In Part 1 and Part 2 of this series, I developed a Q-Learning app for smartphones and applied it to a simple example (a robot gets a gem in a straight corridor). This time, I revised this app so that the robot can act on the 2D grid. However, as the grid size increases, the traditional method of holding the Q-table and updating the Q value becomes difficult due to the increase in memory capacity and processing volume. As a promising alternative, I considered using a neural network. And here's the result of doing that with Python on a PC (not a smartphone).

● Revised version of the Q-Learning app
In the revised version of the smartphone app, as shown in Fig. 1(a), the robot is trained to reach the gem while avoiding barriers on the 4x4 grid. The Q-Learning algorithm is basically the same as last time. There is one gem and one barrier, and their positions change randomly for each play (every episode). The robot also starts from a random position. After sufficient training, the robot can always reach the gem in the shortest route. On the other hand, if not well trained, the robot often gets lost and hits a wall, as shown in Fig. 1 (b).

● Memory capacity required for Q-Learning
The size of the Q-table required for this learning can be calculated according to the grid size and the number of gems and barriers. See Fig.2. The number of Q-table entries (i.e., the number of keys) is the total number of possible states, which in Case 1 (4x4) is 3,360. At this level, it can be held sufficiently even on a smartphone, and the amount of calculation is within an acceptable range. However, in Case2 (5x5), the total number of states increases sharply to over 6,000,000, even though only one gem and one barrier have been added. In this situation, regardless of whether it is a smartphone or a PC, processing is almost impossible due to both the amount of memory and the amount of calculation.

● Calculate Q-values with neural network (without holding Q-table)
For cases like Case2 above, you can think of a way to calculate the required Q-value with a neural network without holding the Q-table. To do this, transform the Q-value update formula for the neural network, as shown in Fig.3. This makes it possible to compare output and target ([1]). It can be used to solve this problem with common supervised machine learning. This machine learning iteration allows the output to be closer to the target and, as a result, the Q-value to be closer to the exact value.

Fig.4 clearly shows how to use this neural network in the case of Case1. Note that in this example, the action "W (west)" is taken in the current state S. In this way, one learning is done only for one action in one state. This learning should be repeated for as many actions as possible, in as many states as possible.

● Calculation example of Q-value by neural network
I implemented a learning method using such a neural network in Python and executed it on a PC. This program is based on the Python program (using Tensorflow / Keras) published by Dr. Makoto Ito in reference [1]. Fig.5 shows the learning process for Case1 (4x4). It shows the situation where 10000 episodes were randomly trained. In the upper graph, the average sum of rewards per episode has reached about 0.8. On the other hand, when the neural network is not used, as shown in the figure on the right of Fig. 1(a) (although the characters are small and difficult to see), it is 0.8305, so both results are almost the same. The lower graph shows that the average number of steps a robot takes to reach the gem is about 2.9. This value is also valid considering the situation in Fig.1.

I have omitted the details, but in the case of Case2 (5x5), I was able to train well with this neural network as well. It took only about 3 minutes to run on a general PC, so I was able to confirm the usefulness of this method. This time I've only used the most basic neural networks, but for more complex problems (for example, if you need to remember the location of an object), you may need other neural networks such as LSTMs.

Acknowledgments
I was able to create a Q-value calculation program using a neural network by referring to the Python program published in the reference [1]. I would like to thank Dr. Makoto Ito, the author of this article.

References
[1] Makoto Ito's Blog Article: M-note Program and Electronic Work (in Japanese)
http://itoshi.main.jp/tech/

2021年9月25日土曜日

Running Q-Learning on your palm (Part3)

0 件のコメント:

コメントを投稿