2021年9月25日土曜日

Running Q-Learning on your palm (Part3)

Japanese summary 本シリーズのPart1Part2では、スマホ向けのQ-Learningアプリを開発し、それを簡単な例(直線の廊下でロボットが宝石を得る)に適用しました。今回は、このアプリを改訂し、2次元グリッドでロボットが行動できるようにしました。そして、グリッドサイズが大きくなるにしたがい、これまでのQ-tableを保持してQ値を更新する方法は、メモリ量と処理量の急増により破綻することを確認しました。それに変わる有望な方法として、Neural Networkの利用を検討し、それを(スマホではなく)PC上のPythonで実現した結果を示します。

Abstract
In Part 1 and Part 2 of this series, I developed a Q-Learning app for smartphones and applied it to a simple example (a robot gets a gem in a straight corridor). This time, I revised this app so that the robot can act on the 2D grid. However, as the grid size increases, the traditional method of holding the Q-table and updating the Q value becomes difficult due to the increase in memory capacity and processing volume. As a promising alternative, I considered using a neural network. And here's the result of doing that with Python on a PC (not a smartphone).

 Revised version of the Q-Learning app
In the revised version of the smartphone app, as shown in Fig. 1(a), the robot is trained to reach the gem while avoiding barriers on the 4x4 grid. The Q-Learning algorithm is basically the same as last time. There is one gem and one barrier, and their positions change randomly for each play (every episode). The robot also starts from a random position. After sufficient training, the robot can always reach the gem in the shortest route. On the other hand, if not well trained, the robot often gets lost and hits a wall, as shown in Fig. 1 (b).


 Memory capacity required for Q-Learning
The size of the Q-table required for this learning can be calculated according to the grid size and the number of gems and barriers. See Fig.2. The number of Q-table entries (i.e., the number of keys) is the total number of possible states, which in Case 1 (4x4) is 3,360. At this level, it can be held sufficiently even on a smartphone, and the amount of calculation is within an acceptable range. However, in Case2 (5x5), the total number of states increases sharply to over 6,000,000, even though only one gem and one barrier have been added. In this situation, regardless of whether it is a smartphone or a PC, processing is almost impossible due to both the amount of memory and the amount of calculation.

 Calculate Q-values with neural network (without holding Q-table)
For cases like Case2 above, you can think of a way to calculate the required Q-value with a neural network without holding the Q-table. To do this, transform the Q-value update formula for the neural network, as shown in Fig.3. This makes it possible to compare output and target  ([1]). It can be used to solve this problem with common supervised machine learning. This machine learning iteration allows the output to be closer to the target and, as a result, the Q-value to be closer to the exact value.

Fig.4 clearly shows how to use this neural network in the case of Case1. Note that in this example, the action "W (west)" is taken in the current state S. In this way, one learning is done only for one action in one state. This learning should be repeated for as many actions as possible, in as many states as possible.

 Calculation example of Q-value by neural network
I implemented a learning method using such a neural network in Python and executed it on a PC. This program is based on the Python program (using Tensorflow / Keras) published by Dr. Makoto Ito in reference [1]. Fig.5 shows the learning process for Case1 (4x4). It shows the situation where 10000 episodes were randomly trained. In the upper graph, the average sum of rewards per episode has reached about 0.8. On the other hand, when the neural network is not used, as shown in the figure on the right of Fig. 1(a) (although the characters are small and difficult to see), it is 0.8305, so both results are almost the same. The lower graph shows that the average number of steps a robot takes to reach the gem is about 2.9. This value is also valid considering the situation in Fig.1.


I have omitted the details, but in the case of Case2 (5x5), I was able to train well with this neural network as well. It took only about 3 minutes to run on a general PC, so I was able to confirm the usefulness of this method. This time I've only used the most basic neural networks, but for more complex problems (for example, if you need to remember the location of an object), you may need other neural networks such as LSTMs.

Acknowledgments
I was able to create a Q-value calculation program using a neural network by referring to the Python program published in the reference [1]. I would like to thank Dr. Makoto Ito, the author of this article.

References
[1] Makoto Ito's Blog Article: M-note Program and Electronic Work (in Japanese)
     http://itoshi.main.jp/tech/

2021年9月12日日曜日

Running Q-Learning on your palm (Part2)

 Japanese summary 前回の記事では、スマホ向けのQ-Learningアプリを開発し、それを簡単な例(ロボットが宝石を得る)に適用しました。今回は、ロボットの行動にいくつかのバリエーションを与えてみました。その場合でも、新しい行動の記述を追加する以外は、このスマホアプリをほとんど変更していません。今回の例題でも、Q-Learningによる学習の結果、ロボットは宝石を得るための最適な手順を自ら発見できました。

Abstract
In the previous article, I developed a Q-Learning app for smartphones and applied it to a simple example (a robot gets a gem). This time, I gave some variations to the behavior of the robot. Even so, I haven't changed much of this smartphone app, except to add a new behavioral description. In this example as well, as a result of learning by Q-Learning, the robot was able to discover the optimal procedure for obtaining the gem.

# For the case where the robot moves on the 2D grid, please see this revised version.

 New examples (two cases)
As in the last time, as shown in the figure below, the task is for the robot to move the corridor and get the gem. The actions that the robot can take are different from the last time, but learning the best steps to successfully acquire a gem is the same.



Consider the following two cases regarding robot behavior and its rewards. In both cases, an episode ends when a "Take" is performed (regardless of success or failure) or the robot deviates from the corridor boundary.

Case1:
  • Take: Take the gem (reward = +5 if successful, otherwise -1)
  • Forward: Move forward one block in the corridor (reward = -1)
  • Jump: Move forward two blocks in the corridor (reward = -1)
Case2:
  • Take: Take the gem (reward = +5 if successful, otherwise -1)
  • Back: Go back one block in the corridor (reward = -1)
  • Skip2: Skip two blocks in the corridor (reward = -1)

 Learning results in Case1 and the robot moving example
As a result of fully executing Q-Learning for Case1, we obtained a highly accurate Q-table. Using it, the robot was able to discover the optimal procedure for obtaining the gem, as shown in Fig.1. In the initial state of this example, the positions of R (Robot) and G (Gem) are expressed as "R . . G . .". The corresponding maximum value of Q-table is given by "Forward" and "Jump" (where, both values are 3.0.). Whichever is adopted, it will be the same after all, but here, "Jump" was taken. At the transition destinations after this, the action that maximizes the Q-table value was also taken, so the gem was successfully acquired. This is the best procedure.


 Learning results in Case2 and the robot moving example
The robot's actions possible in Case2 is different from Case1, but similarly, the robot was able to discover a procedure for obtaining the gem. The situation is shown in Fig.2. In the initial state of this example, the positions of R (Robot) and G (Gem) are expressed as "R . G . . .".  This procedure is optimal by combining "Skip2" and "Back".


Here's another slightly more complicated example in Case2. See the Gif animation below. In this example, the robot found the best steps to get the gem:

“R . . . G .”  →Skip2BackBackSkip2Take[success]

 Setting rewards according to purpose
The reward values shown above can be changed depending on the purpose. For example, unlike the above, let's say you want to get the gem in the best way, even if the robot starts at any position. In such cases, change the reward design of Case 2 as follows and name it Case 3.

Case3:
  • Take: Take the gem (reward = +5 if successful, otherwise -1)
  • Back: Go back one block in the corridor (reward = -1 if it is in the corridor after moving, otherwise -2)
  • Skip2: Skip two blocks in the corridor (reward = -1 if it is in the corridor after moving, otherwise -2)

This reward design will serve your purpose, as in the examples below: