2021年9月12日日曜日

Running Q-Learning on your palm (Part2)

 Japanese summary 前回の記事では、スマホ向けのQ-Learningアプリを開発し、それを簡単な例(ロボットが宝石を得る)に適用しました。今回は、ロボットの行動にいくつかのバリエーションを与えてみました。その場合でも、新しい行動の記述を追加する以外は、このスマホアプリをほとんど変更していません。今回の例題でも、Q-Learningによる学習の結果、ロボットは宝石を得るための最適な手順を自ら発見できました。

Abstract
In the previous article, I developed a Q-Learning app for smartphones and applied it to a simple example (a robot gets a gem). This time, I gave some variations to the behavior of the robot. Even so, I haven't changed much of this smartphone app, except to add a new behavioral description. In this example as well, as a result of learning by Q-Learning, the robot was able to discover the optimal procedure for obtaining the gem.

# For the case where the robot moves on the 2D grid, please see this revised version.

 New examples (two cases)
As in the last time, as shown in the figure below, the task is for the robot to move the corridor and get the gem. The actions that the robot can take are different from the last time, but learning the best steps to successfully acquire a gem is the same.



Consider the following two cases regarding robot behavior and its rewards. In both cases, an episode ends when a "Take" is performed (regardless of success or failure) or the robot deviates from the corridor boundary.

Case1:
  • Take: Take the gem (reward = +5 if successful, otherwise -1)
  • Forward: Move forward one block in the corridor (reward = -1)
  • Jump: Move forward two blocks in the corridor (reward = -1)
Case2:
  • Take: Take the gem (reward = +5 if successful, otherwise -1)
  • Back: Go back one block in the corridor (reward = -1)
  • Skip2: Skip two blocks in the corridor (reward = -1)

 Learning results in Case1 and the robot moving example
As a result of fully executing Q-Learning for Case1, we obtained a highly accurate Q-table. Using it, the robot was able to discover the optimal procedure for obtaining the gem, as shown in Fig.1. In the initial state of this example, the positions of R (Robot) and G (Gem) are expressed as "R . . G . .". The corresponding maximum value of Q-table is given by "Forward" and "Jump" (where, both values are 3.0.). Whichever is adopted, it will be the same after all, but here, "Jump" was taken. At the transition destinations after this, the action that maximizes the Q-table value was also taken, so the gem was successfully acquired. This is the best procedure.


 Learning results in Case2 and the robot moving example
The robot's actions possible in Case2 is different from Case1, but similarly, the robot was able to discover a procedure for obtaining the gem. The situation is shown in Fig.2. In the initial state of this example, the positions of R (Robot) and G (Gem) are expressed as "R . G . . .".  This procedure is optimal by combining "Skip2" and "Back".


Here's another slightly more complicated example in Case2. See the Gif animation below. In this example, the robot found the best steps to get the gem:

“R . . . G .”  →Skip2BackBackSkip2Take[success]

 Setting rewards according to purpose
The reward values shown above can be changed depending on the purpose. For example, unlike the above, let's say you want to get the gem in the best way, even if the robot starts at any position. In such cases, change the reward design of Case 2 as follows and name it Case 3.

Case3:
  • Take: Take the gem (reward = +5 if successful, otherwise -1)
  • Back: Go back one block in the corridor (reward = -1 if it is in the corridor after moving, otherwise -2)
  • Skip2: Skip two blocks in the corridor (reward = -1 if it is in the corridor after moving, otherwise -2)

This reward design will serve your purpose, as in the examples below:


0 件のコメント:

コメントを投稿