Japanese summary 前回の記事では、スマホ向けのQ-Learningアプリを開発し、それを簡単な例(ロボットが宝石を得る)に適用しました。今回は、ロボットの行動にいくつかのバリエーションを与えてみました。その場合でも、新しい行動の記述を追加する以外は、このスマホアプリをほとんど変更していません。今回の例題でも、Q-Learningによる学習の結果、ロボットは宝石を得るための最適な手順を自ら発見できました。
Abstract
In the previous article, I developed a Q-Learning app for smartphones and applied it to a simple example (a robot gets a gem). This time, I gave some variations to the behavior of the robot. Even so, I haven't changed much of this smartphone app, except to add a new behavioral description. In this example as well, as a result of learning by Q-Learning, the robot was able to discover the optimal procedure for obtaining the gem.
# For the case where the robot moves on the 2D grid, please see this revised version.
● New examples (two cases)
As in the last time, as shown in the figure below, the task is for the robot to move the corridor and get the gem. The actions that the robot can take are different from the last time, but learning the best steps to successfully acquire a gem is the same.
- Take: Take the gem (reward = +5 if successful, otherwise -1)
- Forward: Move forward one block in the corridor (reward = -1)
- Jump: Move forward two blocks in the corridor (reward = -1)
- Take: Take the gem (reward = +5 if successful, otherwise -1)
- Back: Go back one block in the corridor (reward = -1)
- Skip2: Skip two blocks in the corridor (reward = -1)
● Learning results in Case1 and the robot moving example
As a result of fully executing Q-Learning for Case1, we obtained a highly accurate Q-table. Using it, the robot was able to discover the optimal procedure for obtaining the gem, as shown in Fig.1. In the initial state of this example, the positions of R (Robot) and G (Gem) are expressed as "R . . G . .". The corresponding maximum value of Q-table is given by "Forward" and "Jump" (where, both values are 3.0.). Whichever is adopted, it will be the same after all, but here, "Jump" was taken. At the transition destinations after this, the action that maximizes the Q-table value was also taken, so the gem was successfully acquired. This is the best procedure.
The robot's actions possible in Case2 is different from Case1, but similarly, the robot was able to discover a procedure for obtaining the gem. The situation is shown in Fig.2. In the initial state of this example, the positions of R (Robot) and G (Gem) are expressed as "R . G . . .". This procedure is optimal by combining "Skip2" and "Back".
- Take: Take the gem (reward = +5 if successful, otherwise -1)
- Back: Go back one block in the corridor (reward = -1 if it is in the corridor after moving, otherwise -2)
- Skip2: Skip two blocks in the corridor (reward = -1 if it is in the corridor after moving, otherwise -2)
0 件のコメント:
コメントを投稿